Visit The New, Modern Unix Linux Community

Linux and UNIX Man Pages

Test Your Knowledge in Computers #65
Difficulty: Easy
In the TCP/IP model, end-to-end connectivity is provided from host-to-host in the transport layer.
True or False?
Linux & Unix Commands - Search Man Pages

sanlock(8) [centos man page]

SANLOCK(8)						      System Manager's Manual							SANLOCK(8)

NAME
sanlock - shared storage lock manager SYNOPSIS
sanlock [COMMAND] [ACTION] ... DESCRIPTION
The sanlock daemon manages leases for applications running on a cluster of hosts with shared storage. All lease management and coordina- tion is done through reading and writing blocks on the shared storage. Two types of leases are used, each based on a different algorithm: "delta leases" are slow to acquire and require regular i/o to shared storage. A delta lease exists in a single sector of storage. Acquir- ing a delta lease involves reads and writes to that sector separated by specific delays. Once acquired, a lease must be renewed by updat- ing a timestamp in the sector regularly. sanlock uses a delta lease internally to hold a lease on a host_id. host_id leases prevent two hosts from using the same host_id and provide basic host liveness information based on the renewals. "paxos leases" are generally fast to acquire and sanlock makes them available to applications as general purpose resource leases. A paxos lease exists in 1MB of shared storage (8MB for 4k sectors). Acquiring a paxos lease involves reads and writes to max_hosts (2000) sectors in a specific sequence specified by the Disk Paxos algorithm. paxos leases use host_id's internally to indicate the owner of the lease, and the algorithm fails if different hosts use the same host_id. So, delta leases provide the unique host_id's used in paxos leases. paxos leases also refer to delta leases to check if a host_id is alive. Before sanlock can be used, the user must assign each host a host_id, which is a number between 1 and 2000. Two hosts should not be given the same host_id (even though delta leases attempt to detect this mistake.) sanlock views a pool of storage as a "lockspace". Each distinct pool of storage, e.g. from different sources, would typically be defined as a separate lockspace, with a unique lockspace name. Part of this storage space must be reserved and initialized for sanlock to store delta leases. Each host that wants to use the lockspace must first acquire a delta lease on its host_id number within the lockspace. (See the add_lockspace action/api.) The space required for 2000 delta leases in the lockspace (for 2000 possible host_id's) is 1MB (8MB for 4k sectors). (This is the same size required for a single paxos lease.) More storage space must be reserved and initialized for paxos leases, according to the needs of the applications using sanlock. The following steps illustrate these concepts using the command line. Applications may choose to do these same steps through libsanlock. 1. Create storage pools and reserve and initialize host_id leases two different LUNs on a SAN: /dev/sdb, /dev/sdc # vgcreate pool1 /dev/sdb # vgcreate pool2 /dev/sdc # lvcreate -n hostid_leases -L 1MB pool1 # lvcreate -n hostid_leases -L 1MB pool2 # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0 # sanlock direct init -s LS2:0:/dev/pool2/hostid_leases:0 2. Start the sanlock daemon on each host # sanlock daemon 3. Add each lockspace to be used host1: # sanlock client add_lockspace -s LS1:1:/dev/pool1/hostid_leases:0 # sanlock client add_lockspace -s LS2:1:/dev/pool2/hostid_leases:0 host2: # sanlock client add_lockspace -s LS1:2:/dev/pool1/hostid_leases:0 # sanlock client add_lockspace -s LS2:2:/dev/pool2/hostid_leases:0 4. Applications can now reserve/initialize space for resource leases, and then acquire the leases as they need to access the resources. The resource leases that are created and how they are used depends on the application. For example, say application A, running on host1 and host2, needs to synchronize access to data it stores on /dev/pool1/Adata. A could use a resource lease as follows: 5. Reserve and initialize a single resource lease for Adata # lvcreate -n Adata_lease -L 1MB pool1 # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0 6. Acquire the lease from the app using libsanlock (see sanlock_register, sanlock_acquire). If the app is already running as pid 123, and has registered with the sanlock daemon, the lease can be added for it manually. # sanlock client acquire -r LS1:Adata:/dev/pool1/Adata_lease:0 -p 123 offsets offsets must be 1MB aligned for disks with 512 byte sectors, and 8MB aligned for disks with 4096 byte sectors. offsets may be used to place leases on the same device rather than using separate devices and offset 0 as shown in examples above, e.g. these commands above: # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0 # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0 could be replaced by: # sanlock direct init -s LS1:0:/dev/pool1/leases:0 # sanlock direct init -r LS1:Adata:/dev/pool1/leases:1048576 failures If a process holding resource leases fails or exits without releasing its leases, sanlock will release the leases for it automatically. If the sanlock daemon cannot renew a lockspace host_id for a specific period of time (usually because storage access is lost), sanlock will kill any process holding a resource lease within the lockspace. If the sanlock daemon crashes or gets stuck, it will no longer renew the expiry time of its per-host_id connections to the wdmd daemon, and the watchdog device will reset the host. watchdog sanlock uses the wdmd(8) daemon to access /dev/watchdog. A separate wdmd connection is maintained with wdmd for each host_id being renewed. Each host_id connection has an expiry time for some seconds in the future. After each successful host_id renewal, sanlock updates the associated expiry time in wdmd. If wdmd finds any connection expired, it will not pet /dev/watchdog. After enough successive expired/failed checks, the watchdog device will fire and reset the host. After a number of failed attempts to renew a host_id, sanlock kills any process using that lockspace. Once all those processes have exited, sanlock will unregister the associated wdmd connection. wdmd will no longer find the expired connection, and will resume petting /dev/watchdog (assuming it finds no other failed/expired tests.) If the killed processes did not exit quickly enough, the expired wdmd connection will not be unregistered, and /dev/watchdog will reset the host. Based on these known timeout values, sanlock on another host can calculate, based on the last host_id renewal, when the failed host will have been reset by its watchdog (or killed all the necessary processes). If the sanlock daemon itself fails, crashes, get stuck, it will no longer update the expiry time for its host_id connections to wdmd, which will also lead to the watchdog resetting the host. safety sanlock leases are meant to guarantee that two process on two hosts are never allowed to hold the same resource lease at once. If they were, the resource being protected may be corrupted. There are three levels of protection built into sanlock itself: 1. The paxos leases and delta leases themselves. 2. If the leases cannot function because storage access is lost (host_id's cannot be renewed), the sanlock daemon kills any pids using resource leases in the lockspace. 3. If the pids do not exit after being killed, or if the sanlock daemon fails, the watchdog device resets the host. OPTIONS
COMMAND can be one of three primary top level choices sanlock daemon start daemon sanlock client send request to daemon (default command if none given) sanlock direct access storage directly (no coordination with daemon) sanlock daemon [options] -D no fork and print all logging to stderr -Q 0|1 quiet error messages for common lock contention -R 0|1 renewal debugging, log debug info for each renewal -L pri write logging at priority level and up to logfile (-1 none) -S pri write logging at priority level and up to syslog (-1 none) -U uid user id -G gid group id -t num max worker threads -g sec seconds for graceful recovery -w 0|1 use watchdog through wdmd -h 0|1 use high priority (RR) scheduling -l num use mlockall (0 none, 1 current, 2 current and future) -a 0|1 use async i/o sanlock client action [options] sanlock client status Print processes, lockspaces, and resources being managed by the sanlock daemon. Add -D to show extra internal daemon status for debugging. Add -o p to show resources by pid, or -o s to show resources by lockspace. sanlock client host_status Print state of host_id delta leases read during the last renewal. State of all lockspaces is shown (use -s to select one). Add -D to show extra internal daemon status for debugging. sanlock client gets Print lockspaces being managed by the sanlock daemon. The LOCKSPACE string will be followed by ADD or REM if the lockspace is currently being added or removed. Add -h 1 to also show hosts in each lockspace. sanlock client log_dump Print the sanlock daemon internal debug log. sanlock client shutdown Ask the sanlock daemon to exit. Without the force option (-f 0), the command will be ignored if any lockspaces exist. With the force option (-f 1), any registered processes will be killed, their resource leases released, and lockspaces removed. sanlock client init -s LOCKSPACE Tell the sanlock daemon to initialize a lockspace on disk. The -o option can be used to specify the io timeout to be written in the host_id leases. (Also see sanlock direct init.) sanlock client init -r RESOURCE Tell the sanlock daemon to initialize a resource lease on disk. (Also see sanlock direct init.) sanlock client read -s LOCKSPACE Tell the sanlock daemon to read a lockspace from disk. Only the LOCKSPACE path and offset are required. If host_id is zero, the first record at offset (host_id 1) is used. The complete LOCKSPACE and io timeout are printed. sanlock client read -r RESOURCE Tell the sanlock daemon to read a resource lease from disk. Only the RESOURCE path and offset are required. The complete RESOURCE is printed. (Also see sanlock direct read_leader.) sanlock client align -s LOCKSPACE Tell the sanlock daemon to report the required lease alignment for a storage path. Only path is used from the LOCKSPACE argument. sanlock client add_lockspace -s LOCKSPACE Tell the sanlock daemon to acquire the specified host_id in the lockspace. This will allow resources to be acquired in the lockspace. The -o option can be used to specify the io timeout of the acquiring host, and will be written in the host_id lease. sanlock client inq_lockspace -s LOCKSPACE Inquire about the state of the lockspace in the sanlock daemon, whether it is being added or removed, or is joined. sanlock client rem_lockspace -s LOCKSPACE Tell the sanlock daemon to release the specified host_id in the lockspace. Any processes holding resource leases in this lockspace will be killed, and the resource leases not released. sanlock client command -r RESOURCE -c path args Register with the sanlock daemon, acquire the specified resource lease, and exec the command at path with args. When the command exits, the sanlock daemon will release the lease. -c must be the final option. sanlock client acquire -r RESOURCE -p pid sanlock client release -r RESOURCE -p pid Tell the sanlock daemon to acquire or release the specified resource lease for the given pid. The pid must be registered with the sanlock daemon. acquire can optionally take a versioned RESOURCE string RESOURCE:lver, where lver is the version of the lease that must be acquired, or fail. sanlock client convert -r RESOURCE -p pid Tell the sanlock daemon to convert the mode of the specified resource lease for the given pid. If the existing mode is exclusive (default), the mode of the lease can be converted to shared with RESOURCE:SH. If the existing mode is shared, the mode of the lease can be converted to exclusive with RESOURCE (no :SH suffix). sanlock client inquire -p pid Print the resource leases held the given pid. The format is a versioned RESOURCE string "RESOURCE:lver" where lver is the version of the lease held. sanlock client request -r RESOURCE -f force_mode Request the owner of a resource do something specified by force_mode. A versioned RESOURCE:lver string must be used with a greater version than is presently held. Zero lver and force_mode clears the request. sanlock client examine -r RESOURCE Examine the request record for the currently held resource lease and carry out the action specified by the requested force_mode. sanlock client examine -s LOCKSPACE Examine requests for all resource leases currently held in the named lockspace. Only lockspace_name is used from the LOCKSPACE argument. sanlock direct action [options] -a 0|1 use async i/o -o sec io timeout in seconds sanlock direct init -s LOCKSPACE sanlock direct init -r RESOURCE Initialize storage for 2000 host_id (delta) leases for the given lockspace, or initialize storage for one resource (paxos) lease. Both options require 1MB of space. The host_id in the LOCKSPACE string is not relevant to initialization, so the value is ignored. (The default of 2000 host_ids can be changed for special cases using the -n num_hosts and -m max_hosts options.) With -s, the -o option speci- fies the io timeout to be written in the host_id leases. sanlock direct read_leader -s LOCKSPACE sanlock direct read_leader -r RESOURCE Read a leader record from disk and print the fields. The leader record is the single sector of a delta lease, or the first sector of a paxos lease. sanlock direct dump path[:offset] Read disk sectors and print leader records for delta or paxos leases. Add -f 1 to print the request record values for paxos leases, and host_ids set in delta lease bitmaps. LOCKSPACE option string -s lockspace_name:host_id:path:offset lockspace_name name of lockspace host_id local host identifier in lockspace path path to storage reserved for leases offset offset on path (bytes) RESOURCE option string -r lockspace_name:resource_name:path:offset lockspace_name name of lockspace resource_name name of resource path path to storage reserved for leases offset offset on path (bytes) RESOURCE option string with suffix -r lockspace_name:resource_name:path:offset:lver lver leader version -r lockspace_name:resource_name:path:offset:SH SH indicates shared mode Defaults sanlock help shows the default values for the options above. sanlock version shows the build version. USAGE
Request/Examine The first part of making a request for a resource is writing the request record of the resource (the sector following the leader record). To make a successful request: o RESOURCE:lver must be greater than the lver presently held by the other host. This implies the leader record must be read to discover the lver, prior to making a request. o RESOURCE:lver must be greater than or equal to the lver presently written to the request record. Two hosts may write a new request at the same time for the same lver, in which case both would succeed, but the force_mode from the last would win. o The force_mode must be greater than zero. o To unconditionally clear the request record (set both lver and force_mode to 0), make request with RESOURCE:0 and force_mode 0. The owner of the requested resource will not know of the request unless it is explicitly told to examine its resources via the "examine" api/command, or otherwise notfied. The second part of making a request is notifying the resource lease owner that it should examine the request records of its resource leases. The notification will cause the lease owner to automatically run the equivalent of "sanlock client examine -s LOCKSPACE" for the lockspace of the requested resource. The notification is made using a bitmap in each host_id delta lease. Each bit represents each of the possible host_ids (1-2000). If host A wants to notify host B to examine its resources, A sets the bit in its own bitmap that corresponds to the host_id of B. When B next renews its delta lease, it reads the delta leases for all hosts and checks each bitmap to see if its own host_id has been set. It finds the bit for its own host_id set in A's bitmap, and examines its resource request records. (The bit remains set in A's bitmap for request_finish_seconds.) force_mode determines the action the resource lease owner should take: 1 (FORCE): kill the process holding the resource lease. When the process has exited, the resource lease will be released, and can then be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if SIGKILL is restricted.) 2 (GRACEFUL): run the program configured by sanlock_killpath against the process holding the resource lease. If no killpath is defined, then FORCE is used. Graceful recovery When a lockspace host_id cannot be renewed for a specific period of time, sanlock enters a recovery mode in which it attempts to forcibly release any resource leases in that lockspace. If all the leases are not released within 60 seconds, the watchdog will fire, resetting the host. The most immediate way of releasing the resource leases in the failed lockspace is by sending SIGKILL to all pids holding the leases, and automatically releasing the resource leases as the pids exit. After all pids have exited, no resource leases are held in the lockspace, the watchdog expiration is removed, and the host can avoid the watchdog reset. A slightly more graceful approach is to send SIGTERM to a pid before escalating to SIGKILL. sanlock does this by sending SIGTERM to each pid, once a second, for the first N seconds, before sending SIGKILL once a second for the remaining M seconds (N/M can be tuned with the -g daemon option.) An even more graceful approach is to configure a program for sanlock to run that will terminate or suspend each pid, and explicitly release the leases it held. sanlock will run this program for each pid. It has N seconds to terminate the pid or explicitly release its leases before sanlock escalates to SIGKILL for the remaining M seconds. SEE ALSO
wdmd(8) 2011-08-05 SANLOCK(8)

Featured Tech Videos