OPAL-RESTART(1) Open MPI OPAL-RESTART(1)NAME
opal-restart - Restart a previously checkpointed sequential process using the Open PAL Checkpoint/Restart Service (CRS)
Note: This should only be used by the user if the application being restarted is an OPAL-only application. If it is an Open RTE or Open MPI
program their respective tools should be used.
opal-restart [ options ] <SNAPSHOT HANDLE>
opal-restart will attempt to restart a previously checkpointed squential process from the snapshot handle reference returned by opal_check-
The snapshot handle reference returned by opal_checkpoint, used to restart the process. This is required to be the last argument
to this command.
-h | --help
Display help for this command
--fork Fork off a new process, which is the restarted process. By default, the restarted process will replace opal-restart process.
-w | --where
The location of the local snapshot reference.
-s | --self
Restart this process using the self CRS component. This component is a special case, all other CRS components are automatically
-v | --verbose
Enable verbose output for debugging.
-gmca | --gmca <key> <value>
Pass global MCA parameters that are applicable to all contexts. <key> is the parameter name; <value> is the parameter value.
-mca | --mca <key> <value>
Send arguments to various MCA modules.
opal-restart can be invoked multiple, non-overlapping times. This allows the user to restart a previously running sequential process. See
opal_crs(7) for more information about the CRS framework and components.
When using the self CRS component, the <FILENAME> argument is replaced by the name of the program to be restarted followed by any arguments
that need to be passed to the program. For example, if under normal execution we would start our program "foo" as:
shell$ setenv OMPI_MCA_crs=self
shell$ setenv OMPI_MCA_crs_self_prefix=my_callback_prefix
shell$ ./foo arg1 arg2
To restart this process, we may only need to call:
shell$ opal-restart --self
-mca crs_self_prefix my_callback_prefix
./foo arg1 arg2
This will cause the "my_callback_prefix-restart" function to be called as soon as the program "foo" calls OPAL_INIT. You do not have to
call your program with the same argument set as before. There for we could have just as correctly called:
shell$ opal-restart --self-mca crs_self_prefix my_callback_prefix
This depends upon the behavior of the program "foo".
SEE ALSO opal-checkpoint(1), opal_crs(7)1.4.5 Feb 10, 2012 OPAL-RESTART(1)
Check Out this Related Man Page
lamssi_cr(7) LAM SSI CR OVERVIEW lamssi_cr(7)NAME
lamssi_checkpoint_restart - overview of LAM's MPI checkpoint / restart SSI modules
The "kind" for checkpoint / restart SSI modules is "cr". Specifically, the string "cr" (without the quotes) is the prefix that should be
used with the mpirun command line with the -ssi switch. For example:
mpirun -ssi cr blcr C my_mpi_program
LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs. Doing so requires that LAM/MPI was compiled with thread support and
that back-end checkpointing systems are available at run-time. MPI jobs will have to run with at least MPI_THREAD_SERIALIZED support. If
a job elects to run with checkpoint/restart support and an available cr module is found, the job's thread level will automatically be pro-
moted to MPI_THREAD_SERIALIZED. See the User's Guide for more details.
LAM defines three phases for checkpoint / restart support in each MPI process:
When the checkpoint request arrives, before the actual checkpoint occurs.
After a checkpoint has successfully completed, in the same process as the checkpoint was invoked in.
After a checkpoint has successfully completed, in a new / restarted process.
The Continue and Restart phases are identical except for the process in which they are invoked -- the Continue phase is invoked in the same
process as the Checkpoint phase was invoked. The Restart phase is only invoked in newly restarted processes.
LAM currently has two cr modules: blcr and self. In order for an MPI job to be able to be checkpointed and restarted, all of its MPI SSI
modules must support checkpoint/restart. Currently, this means using the crtcp RPI module or the gm RPI module when compiled with gm_get()
support (see the User's Guide for more details).
blcr CR Module
The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a software system from Lawrence Berkeley Labs. See the project web
page for more details: http://www.nersc.gov/research/ftg/checkpoint/.
The blcr module has one SSI parameter:
blcr's default priority is 50.
self CR Module
The self module, when used with checkpoint/restart SSI modules, will invoke the user-defined functions to save and restore checkpoints. It
is simply a mechanism for user-defined functions to be invoked at LAM's Checkpoint, Continue, and Restart phases. Hence, the only data that
is saved during the checkpoint is what is written in the user's checkpoint function. No MPI library state is saved at all.
As such, the model for the self module is slightly different than, for example, the blcr module. Specifically, the Restart function is not
invoked in the same process image of the process that was checkpointed. The Restart phase is invoked during MPI_INIT of a new instance of
the application (i.e., it starts over from main()).
Multiple SSI parameters are available:
Specify a string prefix for the name of the checkpoint, continue, and restart functions that should be invoked by LAM. That is, speci-
fying "-ssi cr_self_user_prefix foo" means that LAM expects to find three functions at run-time: int foo_checkpoint(), int foo_con-
tinue(), and int foo_restart(). This is a convenience parameter that can be used instead of the three parameters listed below.
Name of the user function to invoke during the Checkpoint phase.
Name of the user function to invoke during the Continue phase.
Name of the user function to invoke during the Restart phase.
If none of these parameters are specified and the self module is selected, it will use the default prefix lam_cr_self
Finally, the usual priority SSI parameter is also available:
self's default priority is 25.
SEE ALSO lamssi(7), mpirun(1), LAM User's Guide
LAM 7.1.4 July, 2007 lamssi_cr(7)