Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

srun_cr(1) [debian man page]

SRUN_CR(1)							 slurm components							SRUN_CR(1)

NAME
srun_cr - run parallel jobs with checkpoint/restart support SYNOPSIS
srun_cr [OPTIONS...] DESCRIPTION
The design of srun_cr is inspired by mpiexec_cr from MVAPICH2 and cr_restart form BLCR. It is a wrapper around the srun command to enable batch job checkpoint/restart support when used with SLURM's checkpoint/blcr plugin. OPTIONS
The srun_cr execute line options are identical to those of the srun command. See "man srun" for details. DETAILS
After initialization, srun_cr registers a thread context callback function. Then it forks a process and executes "cr_run --omit srun" with its arguments. cr_run is employed to exclude the srun process from being dumped upon checkpoint. All catchable signals except SIGCHLD sent to srun_cr will be forwarded to the child srun process. SIGCHLD will be captured to mimic the exit status of srun when it exits. Then srun_cr loops waiting for termination of tasks being launched from srun. The step launch logic of SLURM is augmented to check if srun is running under srun_cr. If true, the environment variable SURN_SRUN_CR_SOCKET should be present, the value of which is the address of a Unix domain socket created and listened to be srun_cr. After launching the tasks, srun tires to connect to the socket and sends the job ID, step ID and the nodes allocated to the step to srun_cr. Upon checkpoint, srun_cr checks to see if the tasks have been launched. If not srun_cr first forwards the checkpoint request to the tasks by calling the SLURM API slurm_checkpoint_tasks() before dumping its process context. Upon restart, srun_cr checks to see if the tasks have been previously launched and checkpointed. If true, the environment variable SLURM_RESTART_DIR is set to the directory of the checkpoint image files of the tasks. Then srun is forked and executed again. The envi- ronment variable will be used by the srun command to restart execution of the tasks from the previous checkpoint. COPYING
Copyright (C) 2009 National University of Defense Technology, China. Produced at National University of Defense Technology, China (cf, DISCLAIMER). CODE-OCEC-09-009. All rights reserved. This file is part of SLURM, a resource management program. For details, see <http://www.schedmd.com/slurmdocs/>. SLURM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. SLURM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. SEE ALSO
srun(1) srun_cr 2.0 March 2009 SRUN_CR(1)

Check Out this Related Man Page

Slurm API(3)						    Slurm checkpoint functions						      Slurm API(3)

NAME
slurm_checkpoint_able, slurm_checkpoint_complete, slurm_checkpoint_create, slurm_checkpoint_disable, slurm_checkpoint_enable, slurm_check- point_error, slurm_checkpoint_restart, slurm_checkpoint_vacate - Slurm checkpoint functions SYNTAX
#include <slurm/slurm.h> int slurm_checkpoint_able ( uint32_t job_id, uint32_t step_id, time_t *start_time, ); int slurm_checkpoint_complete ( uint32_t job_id, uint32_t step_id, time_t start_time, uint32_t error_code, char *error_msg ); int slurm_checkpoint_create ( uint32_t job_id, uint32_t step_id, uint16_t max_wait, char *image_dir ); int slurm_checkpoint_disable ( uint32_t job_id, uint32_t step_id ); int slurm_checkpoint_enable ( uint32_t job_id, uint32_t step_id ); int slurm_checkpoint_error ( uint32_t job_id, uint32_t step_id, uint32_t *error_code, char ** error_msg ); int slurm_checkpoint_restart ( uint32_t job_id, uint32_t step_id, uint16_t stick, char *image_dir ); int slurm_checkpoint_tasks ( uint32_t job_id, uint32_t step_id, time_t begin_time, char *image_dir, uint16_t max_wait, char *nodelist ); int slurm_checkpoint_vacate ( uint32_t job_id, uint32_t step_id, uint16_t max_wait, char *image_dir ); ARGUMENTS
begin_time When to begin the operation. error_code Error code for checkpoint operation. Only the highest value is preserved. error_msg Error message for checkpoint operation. Only the error_msg value for the highest error_code is preserved. image_dir Directory specification for where the checkpoint file should be read from or written to. The default value is specified by the JobCheckpointDir SLURM configuration parameter. job_id SLURM job ID to perform the operation upon. max_wait Maximum time to allow for the operation to complete in seconds. nodelist Nodes to send the request. start_time Time at which last checkpoint operation began (if one is in progress), otherwise zero. step_id SLURM job step ID to perform the operation upon. May be NO_VAL if the operation is to be performed on all steps of the specified job. Specify SLURM_BATCH_SCRIPT to checkpoint a batch job. stick If non-zero then restart the job on the same nodes that it was checkpointed from. DESCRIPTION
slurm_checkpoint_able Report if checkpoint operations can presently be issued for the specified job step. If yes, returns SLURM_SUCCESS and sets start_time if checkpoint operation is presently active. Returns ESLURM_DISABLED if checkpoint operation is disabled. slurm_checkpoint_complete Note that a requested checkpoint has been completed. slurm_checkpoint_create Request a checkpoint for the identified job step. Continue its execution upon completion of the checkpoint. slurm_checkpoint_disable Make the identified job step non-checkpointable. This can be issued as needed to prevent checkpointing while a job step is in a critical section or for other reasons. slurm_checkpoint_enable Make the identified job step checkpointable. slurm_checkpoint_error Get error information about the last checkpoint operation for a given job step. slurm_checkpoint_restart Request that a previously checkpointed job resume execution. It may continue execution on different nodes than were originally used. Execution may be delayed if resources are not immediately available. slurm_checkpoint_vacate Request a checkpoint for the identified job step. Terminate its execution upon completion of the checkpoint. RETURN VALUE
Zero is returned upon success. On error, -1 is returned, and the Slurm error code is set appropriately. ERRORS
ESLURM_INVALID_JOB_ID the requested job or job step id does not exist. ESLURM_ACCESS_DENIED the requesting user lacks authorization for the requested action (e.g. trying to delete or modify another user's job). ESLURM_JOB_PENDING the requested job is still pending. ESLURM_ALREADY_DONE the requested job has already completed. ESLURM_DISABLED the requested operation has been disabled for this job step. This will occur when a request for checkpoint is issued when they have been disabled. ESLURM_NOT_SUPPORTED the requested operation is not supported on this system. EXAMPLE
#include <stdio.h> #include <stdlib.h> #include <slurm/slurm.h> #include <slurm/slurm_errno.h> int main (int argc, char *argv[]) { uint32_t job_id, step_id; if (argc < 3) { printf("Usage: %s job_id step_id ", argv[0]); exit(1); } job_id = atoi(argv[1]); step_id = atoi(argv[2]); if (slurm_checkpoint_disable(job_id, step_id)) { slurm_perror ("slurm_checkpoint_error:"); exit (1); } exit (0); } NOTE
These functions are included in the libslurm library, which must be linked to your process for use (e.g. "cc -lslurm myprog.c"). COPYING
Copyright (C) 2004-2007 The Regents of the University of California. Copyright (C) 2008-2009 Lawrence Livermore National Security. Pro- duced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). CODE-OCEC-09-009. All rights reserved. This file is part of SLURM, a resource management program. For details, see <http://www.schedmd.com/slurmdocs/>. SLURM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. SLURM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. SEE ALSO
srun(1), squeue(1), free(3), slurm.conf(5) Morris Jette March 2009 Slurm API(3)
Man Page