MPI, recovering node

 
Thread Tools Search this Thread
Special Forums UNIX and Linux Applications High Performance Computing MPI, recovering node
# 1  
Old 03-30-2009
MPI, recovering node

Hi all,

I'm writing an MPI application, in which I handle failures and recover them. In order to do that, in case of one node failure, I would like to remove that node from the MPI_COMM_WORLD group and continue with the remaining nodes.

Does anybody know how I can do that?

I'm using MPICH-G2 by the way.

thanks in advance.
# 2  
Old 04-15-2009
Sadly, MPI isn't really built for this. It's one of the big drawbacks to MPI in general. I once asked an MPI guru about this and he said "for performance reasons, MPI is designed for a static number of nodes at startup time". Apparently, your gather/scatter aggregate commands won't work right (or efficiently) if you have dynamic node allocation.

But it's been 3 solid years since I heard this, so maybe OpenMPI has made an improvement on the state of affairs. However, at least with MPICH, once a node fails, the whole process tree is supposed to die. If it doesn't, it's because your cluster admin hasn't done things correctly.
# 3  
Old 04-20-2009
There is a feature called "checkpointing" around to tackle such problems.

Good starting point: https://ftg.lbl.gov/CheckpointRestar...TC2008-BKK.pdf
# 4  
Old 04-27-2009
It really doesn't make much sense to me. MPICH should suppose to run many nodes and there is a big possibility that a node can fail during the execution. It should at least continue the processing with the remaining nodes.

Thanks for the answers though. I'll keep looking for the solution.
# 5  
Old 04-27-2009
The MPI specification predated Beowulf clusters, my friend. Before this time, you had computers of varying numbers of CPUs. It was conceived that you might have clusters of computers, but nothing on today's scale. Besides, the guys who dreamt up MPI were computer scientists, ie, not hardware guys or systems guys. MPI-2, which has the ability to spawn and connect to separate MPI instances, doesn't make this easy.

Search for MPI-2 libraries that support process/communication attachment/detachment. You might find something there. Please post back if you do.

UPDATE: See this PDF/slide presentation http://www.cs.utk.edu/~dongarra/WEB-...2-features.pdf

Search for "Process Management". You use "MPI_COMM_SPAWN" to create a new set of processes with the same arguments on the command line, but now you must use an "INTERcommunicator" (instead of INTRA); you can do MPI_SEND/MPI_RECV, but not collective functions. Still, if a node dies, this doesn't help!! You would basically need to create your own process and communication management on top of MPI. That's why I suggest you look for a library.

Last edited by otheus; 04-27-2009 at 01:24 PM.. Reason: update
# 6  
Old 05-26-2009
The current MPI specification assumes that nodes will stay alive during the execution. A guy who is interested in MPI implementations visited my institution 2 weeks ago and gave a presentation. I asked the same question and he said another specification (MPI-3) will be announced in the summer and this issue will be held. Right now all I can do is writing my own process management into the MPI library I'm using (like otheus mentioned before).
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. HP-UX

Mount FIle systems from node-1 onto node-2

Hi, We have HP UX service guard cluster on OS 11.23. Recently 40+ LUNs presented to both nodes by SAN team but I was asked to mount them on only one node. I created required VGs/LVs, created VxFS and mounted all of them and they are working fine. Now client requested those FS on 2nd node as... (4 Replies)
Discussion started by: prvnrk
4 Replies

2. Homework & Coursework Questions

Accessing one UNIX node from another node of the same server

Hi Experts, I am in need of running a script from one node say node 1 via node 2. My scheduling tool dont have access to node2 , so i need to invoke the list file from node1 but the script needs to run from node2. because the server to which i am hitting, is having access only for the node... (5 Replies)
Discussion started by: arun1377
5 Replies

3. Solaris

SVM metaset on 2 node Solaris cluster storage replicated to non-clustered Solaris node

Hi, Is it possible to have a Solaris cluster of 2 nodes at SITE-A using SVM and creating metaset using say 2 LUNs (on SAN). Then replicating these 2 LUNs to remote site SITE-B via storage based replication and then using these LUNs by importing them as a metaset on a server at SITE-B which is... (0 Replies)
Discussion started by: dn2011
0 Replies

4. UNIX for Advanced & Expert Users

recovering a deleted directory

I accidentally deleted a very important directory today with this rm -r. What would be the recommended way to recover my directory? After a lot of googleing I have seen these choices. Could I get some recommendations please? Testdisk Photorec- Doesn't recover file name like I would like. ... (10 Replies)
Discussion started by: cokedude
10 Replies

5. SCO

Recovering OpenServer 5.0.6 onto different hardware

I'm sorting out the disaster recovery plan for a critical server. It's a Dell PowerEdge 2850 running Openserver 5.0.6a. We have a disaster recovery agreement with HP and they have just confirmed that in the event of a total disaster such as the server being totally wiped out, they would NOT... (2 Replies)
Discussion started by: mmcardle
2 Replies

6. Shell Programming and Scripting

recovering cron job

I deleted one of the job from the cron tab. I want to get it back. How can i do this. pplease suggest me.. thanks (1 Reply)
Discussion started by: pranabrana
1 Replies

7. AIX

Recovering a failed system

Hi,My system is not booting and at the startup it is getting struck.In HMC error code is coming as 0000, I know the reason of failing.I have few queries on recovery, please answer:1. I have mksysb of the system from which I can restore the system but problem is my few application mount point was a... (5 Replies)
Discussion started by: aixpank
5 Replies

8. SCO

Recovering 5.0.7 from Bootable CD

I've been working with SCO Unix for several years now but have never had to restore a system from a bare drive. I have a bootable CD that contains what appears to be the correct files necessary to recover the boot and root filesystems. I've got the BIOS setup such that the CD is the first... (12 Replies)
Discussion started by: teamhog
12 Replies

9. SCO

HELP! Recovering system from New Orleans!!

I am helping a company recover a system that is SCO OS 5.0.5 - they have their backup media, cd copies of SCO, but they do not have their license keys to install and SCO is being difficult in validating their license. Does anyone have an install license key for 5.0.5 that they would be willing... (1 Reply)
Discussion started by: ggraham
1 Replies

10. UNIX for Dummies Questions & Answers

Recovering lost files

I noticed this in a search for more security tools... It IS possible to "undelete" a file; I suppose recover would be a better term for it. I suppose we've all made the boo-boo (that we all hopefully learned from) of deleting a file, and finding that you do not have a backup. I wouldn't... (1 Reply)
Discussion started by: LivinFree
1 Replies
Login or Register to Ask a Question