![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| High Performance Computing Message Passing Interface (MPI) programming and tuning, MPI library installation and management, parallel administration tools, cluster monitoring, cluster optimization, and more HPC topics. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Recovering 5.0.7 from Bootable CD | teamhog | SCO | 12 | 03-28-2009 03:37 PM |
| Help recovering a backed up file | mojoman | UNIX for Dummies Questions & Answers | 1 | 11-20-2008 01:25 PM |
| recovering files removed with rm | jack1981 | UNIX for Dummies Questions & Answers | 4 | 09-15-2006 07:06 AM |
| HELP! Recovering system from New Orleans!! | ggraham | SCO | 1 | 09-01-2005 12:14 PM |
| Recovering lost files | LivinFree | UNIX for Dummies Questions & Answers | 1 | 08-16-2001 10:15 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
MPI, recovering node
Hi all,
I'm writing an MPI application, in which I handle failures and recover them. In order to do that, in case of one node failure, I would like to remove that node from the MPI_COMM_WORLD group and continue with the remaining nodes. Does anybody know how I can do that? I'm using MPICH-G2 by the way. thanks in advance. |
|
||||
|
There is a feature called "checkpointing" around to tackle such problems.
Good starting point: https://ftg.lbl.gov/CheckpointRestar...TC2008-BKK.pdf |
|
||||
|
It really doesn't make much sense to me. MPICH should suppose to run many nodes and there is a big possibility that a node can fail during the execution. It should at least continue the processing with the remaining nodes.
Thanks for the answers though. I'll keep looking for the solution. |
|
|||||
|
The MPI specification predated Beowulf clusters, my friend. Before this time, you had computers of varying numbers of CPUs. It was conceived that you might have clusters of computers, but nothing on today's scale. Besides, the guys who dreamt up MPI were computer scientists, ie, not hardware guys or systems guys. MPI-2, which has the ability to spawn and connect to separate MPI instances, doesn't make this easy.
Search for MPI-2 libraries that support process/communication attachment/detachment. You might find something there. Please post back if you do. UPDATE: See this PDF/slide presentation http://www.cs.utk.edu/~dongarra/WEB-...2-features.pdf Search for "Process Management". You use "MPI_COMM_SPAWN" to create a new set of processes with the same arguments on the command line, but now you must use an "INTERcommunicator" (instead of INTRA); you can do MPI_SEND/MPI_RECV, but not collective functions. Still, if a node dies, this doesn't help!! You would basically need to create your own process and communication management on top of MPI. That's why I suggest you look for a library. Last edited by otheus; 04-27-2009 at 12:24 PM.. Reason: update |
|
||||
|
The current MPI specification assumes that nodes will stay alive during the execution. A guy who is interested in MPI implementations visited my institution 2 weeks ago and gave a presentation. I asked the same question and he said another specification (MPI-3) will be announced in the summer and this issue will be held. Right now all I can do is writing my own process management into the MPI library I'm using (like otheus mentioned before).
|
| Bits Awarded / Charged to SaTYR for this Post | |||
| Date | User | Comment | Amount |
| 05-26-2009 | otheus | Thanks for the follow-up! | 500 |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|