The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Special Forums > High Performance Computing
.
google unix.com



High Performance Computing Message Passing Interface (MPI) programming and tuning, MPI library installation and management, parallel administration tools, cluster monitoring, cluster optimization, and more HPC topics.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Recovering 5.0.7 from Bootable CD teamhog SCO 12 03-28-2009 03:37 PM
Help recovering a backed up file mojoman UNIX for Dummies Questions & Answers 1 11-20-2008 01:25 PM
recovering files removed with rm jack1981 UNIX for Dummies Questions & Answers 4 09-15-2006 07:06 AM
HELP! Recovering system from New Orleans!! ggraham SCO 1 09-01-2005 12:14 PM
Recovering lost files LivinFree UNIX for Dummies Questions & Answers 1 08-16-2001 10:15 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 03-30-2009
SaTYR SaTYR is offline
Registered User
  
 

Join Date: May 2005
Posts: 25
MPI, recovering node

Hi all,

I'm writing an MPI application, in which I handle failures and recover them. In order to do that, in case of one node failure, I would like to remove that node from the MPI_COMM_WORLD group and continue with the remaining nodes.

Does anybody know how I can do that?

I'm using MPICH-G2 by the way.

thanks in advance.
  #2 (permalink)  
Old 04-15-2009
otheus's Avatar
otheus otheus is offline Forum Staff  
Moderator ala Mode
  
 

Join Date: Feb 2007
Location: Innsbruck, Austria
Posts: 1,884
Sadly, MPI isn't really built for this. It's one of the big drawbacks to MPI in general. I once asked an MPI guru about this and he said "for performance reasons, MPI is designed for a static number of nodes at startup time". Apparently, your gather/scatter aggregate commands won't work right (or efficiently) if you have dynamic node allocation.

But it's been 3 solid years since I heard this, so maybe OpenMPI has made an improvement on the state of affairs. However, at least with MPICH, once a node fails, the whole process tree is supposed to die. If it doesn't, it's because your cluster admin hasn't done things correctly.
  #3 (permalink)  
Old 04-20-2009
fabtagon fabtagon is offline
Registered User
  
 

Join Date: Apr 2008
Location: European Union/Germany
Posts: 189
There is a feature called "checkpointing" around to tackle such problems.

Good starting point: https://ftg.lbl.gov/CheckpointRestar...TC2008-BKK.pdf
  #4 (permalink)  
Old 04-27-2009
SaTYR SaTYR is offline
Registered User
  
 

Join Date: May 2005
Posts: 25
It really doesn't make much sense to me. MPICH should suppose to run many nodes and there is a big possibility that a node can fail during the execution. It should at least continue the processing with the remaining nodes.

Thanks for the answers though. I'll keep looking for the solution.
  #5 (permalink)  
Old 04-27-2009
otheus's Avatar
otheus otheus is offline Forum Staff  
Moderator ala Mode
  
 

Join Date: Feb 2007
Location: Innsbruck, Austria
Posts: 1,884
The MPI specification predated Beowulf clusters, my friend. Before this time, you had computers of varying numbers of CPUs. It was conceived that you might have clusters of computers, but nothing on today's scale. Besides, the guys who dreamt up MPI were computer scientists, ie, not hardware guys or systems guys. MPI-2, which has the ability to spawn and connect to separate MPI instances, doesn't make this easy.

Search for MPI-2 libraries that support process/communication attachment/detachment. You might find something there. Please post back if you do.

UPDATE: See this PDF/slide presentation http://www.cs.utk.edu/~dongarra/WEB-...2-features.pdf

Search for "Process Management". You use "MPI_COMM_SPAWN" to create a new set of processes with the same arguments on the command line, but now you must use an "INTERcommunicator" (instead of INTRA); you can do MPI_SEND/MPI_RECV, but not collective functions. Still, if a node dies, this doesn't help!! You would basically need to create your own process and communication management on top of MPI. That's why I suggest you look for a library.

Last edited by otheus; 04-27-2009 at 12:24 PM.. Reason: update
  #6 (permalink)  
Old 05-25-2009
SaTYR SaTYR is offline
Registered User
  
 

Join Date: May 2005
Posts: 25
The current MPI specification assumes that nodes will stay alive during the execution. A guy who is interested in MPI implementations visited my institution 2 weeks ago and gave a presentation. I asked the same question and he said another specification (MPI-3) will be announced in the summer and this issue will be held. Right now all I can do is writing my own process management into the MPI library I'm using (like otheus mentioned before).
Bits Awarded / Charged to SaTYR for this Post
Date User Comment Amount
05-26-2009 otheus Thanks for the follow-up! 500
Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 11:49 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0