Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

ompi_crcp(7) [debian man page]

OMPI_CRCP(7)							     Open MPI							      OMPI_CRCP(7)

NAME
OMPI_CRCP - Open MPI MCA Checkpoint/Restart Coordination Protocol (CRCP) Framework: Overview of Open MPI's CRCP framework, and selected modules. Open MPI 1.4.5 DESCRIPTION
The CRCP Framework is used by Open MPI for the encapsulation of various Checkpoint/Restart Coordination Protocols (e.g., Coordinated, Unco- ordinated, Message/Communication Induced, ...). GENERAL PROCESS REQUIREMENTS
In order for a process to use the Open MPI CRCP components it must adhear to a few programmatic requirements. First, the program must call MPI_INIT early in its execution. The program must call MPI_FINALIZE before termination. A user may initiate a checkpoint of a parallel application by using the ompi-checkpoint(1) and ompi-restart(1) commands. AVAILABLE COMPONENTS
Open MPI currently ships with one CRCP component: coord. The following MCA parameters apply to all components: crcp_base_verbose Set the verbosity level for all components. Default is 0, or silent except on error. coord CRCP Component The coord component implements a Coordinated Checkpoint/Restart Coordination Protocol similar to the one implemented in LAM/MPI. The coord component has the following MCA parameters: crcp_coord_priority The component's priority to use when selecting the most appropriate component for a run. crcp_coord_verbose Set the verbosity level for this component. Default is 0, or silent except on error. none CRCP Component The none component simply selects no CRCP component. All of the CRCP function calls return immediately with ORTE_SUCCESS. This component is the last component to be selected by default. This means that if another component is available, and the none component was not explicity requested then Open MPI will attempt to activate all of the available components before falling back to this component. SEE ALSO
ompi-checkpoint(1), ompi-restart(1), opal-checkpoint(1), opal-restart(1), orte_snapc(7), orte_filem(7), opal_crs(7) 1.4.5 Feb 10, 2012 OMPI_CRCP(7)

Check Out this Related Man Page

lamssi_cr(7)                                                    LAM SSI CR OVERVIEW                                                   lamssi_cr(7)

NAME
lamssi_checkpoint_restart - overview of LAM's MPI checkpoint / restart SSI modules DESCRIPTION
The "kind" for checkpoint / restart SSI modules is "cr". Specifically, the string "cr" (without the quotes) is the prefix that should be used with the mpirun command line with the -ssi switch. For example: mpirun -ssi cr blcr C my_mpi_program LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs. Doing so requires that LAM/MPI was compiled with thread support and that back-end checkpointing systems are available at run-time. MPI jobs will have to run with at least MPI_THREAD_SERIALIZED support. If a job elects to run with checkpoint/restart support and an available cr module is found, the job's thread level will automatically be pro- moted to MPI_THREAD_SERIALIZED. See the User's Guide for more details. Checkpoint Phases LAM defines three phases for checkpoint / restart support in each MPI process: Checkpoint. When the checkpoint request arrives, before the actual checkpoint occurs. Continue. After a checkpoint has successfully completed, in the same process as the checkpoint was invoked in. Restart After a checkpoint has successfully completed, in a new / restarted process. The Continue and Restart phases are identical except for the process in which they are invoked -- the Continue phase is invoked in the same process as the Checkpoint phase was invoked. The Restart phase is only invoked in newly restarted processes. AVAILABLE MODULES
LAM currently has two cr modules: blcr and self. In order for an MPI job to be able to be checkpointed and restarted, all of its MPI SSI modules must support checkpoint/restart. Currently, this means using the crtcp RPI module or the gm RPI module when compiled with gm_get() support (see the User's Guide for more details). blcr CR Module The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a software system from Lawrence Berkeley Labs. See the project web page for more details: http://www.nersc.gov/research/ftg/checkpoint/. The blcr module has one SSI parameter: cr_blcr_priority blcr's default priority is 50. self CR Module The self module, when used with checkpoint/restart SSI modules, will invoke the user-defined functions to save and restore checkpoints. It is simply a mechanism for user-defined functions to be invoked at LAM's Checkpoint, Continue, and Restart phases. Hence, the only data that is saved during the checkpoint is what is written in the user's checkpoint function. No MPI library state is saved at all. As such, the model for the self module is slightly different than, for example, the blcr module. Specifically, the Restart function is not invoked in the same process image of the process that was checkpointed. The Restart phase is invoked during MPI_INIT of a new instance of the application (i.e., it starts over from main()). Multiple SSI parameters are available: cr_self_user_prefix Specify a string prefix for the name of the checkpoint, continue, and restart functions that should be invoked by LAM. That is, speci- fying "-ssi cr_self_user_prefix foo" means that LAM expects to find three functions at run-time: int foo_checkpoint(), int foo_con- tinue(), and int foo_restart(). This is a convenience parameter that can be used instead of the three parameters listed below. cr_self_user_checkpoint Name of the user function to invoke during the Checkpoint phase. cr_self_user_continue Name of the user function to invoke during the Continue phase. cr_self_user_restart Name of the user function to invoke during the Restart phase. If none of these parameters are specified and the self module is selected, it will use the default prefix lam_cr_self Finally, the usual priority SSI parameter is also available: cr_self_priority self's default priority is 25. SEE ALSO
lamssi(7), mpirun(1), LAM User's Guide LAM 7.1.4 July, 2007 lamssi_cr(7)
Man Page