rds-rdma(7) [linux man page]

RDS 
zerocopy(7)                                          Miscellaneous Information Manual                                          RDS zerocopy(7)

NAME

       RDS-rdma - Zerocopy Interface for RDMA over RDS

DESCRIPTION

       This  manual  page  describes the zerocopy interface of RDS, which was added in RDSv3. For a description of the basic RDS interface, please
       refer to rds(7).

       The principal mode of operation for RDS zerocopy is like this: one participant (the client) wishes to initiate a direct transfer to or from
       some area of memory in its process address space.  This memory does not have to be aligned.

       The client obtains a handle for this region of memory, and passes it to the other participant (the server). This is called the RDMA cookie.
       To the application, the cookie is an opaque 64bit data type.

       The client sends this handle to the server application, along with other details of the RDMA request (such as which  data  to  transfer  to
       that memory area).  Throughout the following discussion, we will refer to this message as the RDMA request.

       The  server  uses this RDMA cookie to initiate the requested RDMA transfer. The RDMA transfer is combined atomically with a normal RDS mes-
       sage, which is delivered to the client. This message is called the RDMA ACK throughout the following.  Atomic in this  context  means  that
       either both the RDMA succeeds and the RDMA ACK is delivered, or neither succeeds.

       Thus,  when  the  client  receives the RDMA ACK, it knows that the RDMA has completed successfully. It can then release the RDMA cookie for
       this memory region, if it wishes to.

       RDMA operations are not reliable, in the sense that unlike normal RDS messages, RDS RDMA operations may fail, and get dropped.

INTERFACE

       The interface is currently based on control messages (ancillary data) sent or received via the  sendmsg(2)  and  recvmsg(2)  system  calls.
       Optionally, an older interface can be used that is based on the setsockopt(2) system call. However, we recommend using control messages, as
       this reduces the number of system calls required.

   Control message interface
       With the control message interface, the RDMA cookie is passed to the server out-of-band, included in an extension header  attached  to  the
       RDS message.

       The following outlines the mode of operation; the data types used will be specified in details in a subsequent section.

       Initially,  the client will send RDMA requests along with a RDS_CMSG_RDMA_MAP control message. The control message contains the address and
       length of the memory region for which to obtain a handle, some flags, and a pointer to a memory location (in the  caller's  address  space)
       where the kernel will store the RDMA cookie.

       Alternatively, if the application has already obtained a RDMA cookie for the memory range it wants to RDMA to/from, it can hand this cookie
       to the kernel using the RDS_CMSG_RDMA_DEST control message.

       Either way, the kernel will include the resulting RDMA cookie in an extension header that is transmitted as part of the RDMA request to the
       server.

       When the server receives the RDMA request, the kernel will deliver the cookie wrapped inside a RDS_CMSG_RDMA_DEST control message.

       The  server  then initiates the data transfer by sending the RDMA ACK message along with a RDS_CMSG_RDMA_ARGS control message. This message
       contains the RDMA cookie, and the local memory to copy to or from.

       The server process may request a notification when an RDMA operation completes. Notifications are delivered as a RDS_CMSG_RDMA_STATUS  con-
       trol messages. When an application calls recvmsg(2), it will either receive a regular RDS message (possibly with other RDMA related control
       messages), or an empty message with one or more status control messages.

       In addition, applications When an RDMA operation fails for some reason and is discarded, the application can ask to  receive  notifications
       for  failed  messages  as  well,  regardless of whether it asked for success notification of an individual message or not. This behavior is
       turned on by setting the RDS_RECVERR socket option.

   Setsockopt interface
       In addition to the control message interface, RDS allows a process to register and release memory ranges for RDMA through calls to setsock-
       opt(2).

       RDS_GET_MR
              To obtain a RDMA cookie for a given memory range, the application can use setsockopt with RDS_GET_MR.  This operates essentially the
              same way as the RDS_CMSG_RDMA_MAP control message: the argument contains the address and length of the memory  range  to  be  regis-
              tered, and a pointer to a RDMA cookie variable, in which the system call will store the cookie for the registered range.

       RDS_FREE_MR
              Memory ranges can be released by calling setsockopt with RDS_FREE_MR, giving the RDMA cookie and additional flags as arguments.

       RDS_RECVERR
              This is a boolean option which can be set as well as queried (using getsockopt).  When enabled, RDS will send RDMA notification mes-
              sages to the application for any RDMA operation that fails. This option defaults to off.

       For all of these calls, the level argument to setsockopt is SOL_RDS.

RDMA MACROS AND TYPES

       RDMA cookie
              typedef u_int64_t       rds_rdma_cookie_t

              This encapsulates a memory location in the client process. In the current implementation, it contains the R_Key of the remote memory
              region, and the offset into it (so that the application does not have to worry about alignment.

              The   RDMA  cookie  is  used  in  several  struct  types  described  below.   The  RDS_CMSG_RDMA_DEST  control  message  contains  a
              rds_rdma_cookie_t all by itself as payload.

       Mapping arguments
              The following data type is used with RDS_CMSG_RDMA_MAP control messages and with the RDS_GET_MR socket option:

              struct rds_iovec {
                      u_int64_t       addr;
                      u_int64_t       bytes;
              };

              struct rds_get_mr_args {
                      struct rds_iovec vec;
                      u_int64_t       cookie_addr;
                      uint64_t        flags;
              };

              The cookie_addr specifies a memory location where to store the RDMA cookie.

              The flags value is a bitwise OR of any of the following flags:

              RDS_RDMA_USE_ONCE
                     This tells the kernel that the allocated RDMA cookie is to be used exactly once. When the RDMA ACK message arrives, the  ker-
                     nel will automatically unbind the memory area and release any resources associated with the cookie.

                     If  this  flag  is  not  set,  it  is the application's responsibility to release the memory region at a later time using the
                     RDS_FREE_MR socket option.

              RDS_RDMA_INVALIDATE
                     Normally, RDMA memory mappings are invalidated lazily, as this requires some relatively costly synchronization with the  HCA.
                     However, this means that the server application can continue to access the registered memory for some indeterminate amount of
                     time.  If this flag is set, the RDS code will invalidate the mapping at the time it is released (either upon arrival  of  the
                     RDMA ACK, if USE_ONCE was specified; or when the application destroys it using FREE_MR).

       RDMA Operation
              RDMA operations are initiated by the server using the RDS_CMSG_RDMA_ARGS control message, which takes the following data as payload:

              struct rds_rdma_args {
                      rds_rdma_cookie_t cookie;
                      struct rds_iovec remote_vec;
                      u_int64_t       local_vec_addr;
                      u_int64_t       nr_local;
                      u_int64_t       flags;
                      u_int32_t       user_token;
              };

              The  cookie  argument contains the RDMA cookie received from the client.  The local memory is given via an array of rds_iovecs.  The
              array address is given in local_vec_addr, and its number of elements is given in nr_local.

              The struct member remote_vec specifies a location relative to the memory area identified by the cookie: remote_vec.addr is an offset
              into  that  region, and remote_vec.bytes is the length of the memory window to copy to/from.  This length must match the size of the
              local memory area, i.e. the sum of bytes in all members of the local iovec.

              The flags field contains the bitwise OR of any of the following flags:

              RDS_RDMA_READWRITE
                     If set, any RDMA WRITE is initiated from the server's memory to the client's. If not set, RDS will do a RDMA  READ  from  the
                     client's memory to the server's memory.

              RDS_RDMA_FENCE
                     By default, Infiniband makes no guarantee about the ordering of an RDMA READ with respect to subsequent SEND operations. Set-
                     ting this flag asks that the RDMA READ should be fenced off the subsequent RDS ACK message. Setting  this  flag  requires  an
                     additional round-trip of the IB fabric, but it is a good idea to use set this flag by default, unless you are really sure you
                     do not want it.

              RDS_RDMA_NOTIFY_ME
                     This flag requests a notification upon completion of the RDMA operation (successful or otherwise). The noticiation will  con-
                     tain  the  value of the user_token field passed in by the application. This allows the application to release resources (such
                     as buffers) assosicated with the RDMA transfer.

              The user_token can be used to pass an application specific identifier to the kernel. This token is returned to the application  when
              a status notification is generated (see the following section).

       RDMA Notification
              The RDS kernel code is able to notify the server application when an RDMA operation completes. These notifications are delivered via
              RDS_CMSG_RDMA_STATUS control messages.

              By default, no notifications are generated. There are two ways an application can request them. On one  hand,  status  notifications
              can  be enabled on a per-operation basis by setting the RDS_RDMA_NOTIFY_ME flag in the RDMA arguments. On the other hand, the appli-
              cation can request notifications for all RDMA operations that fail by setting the RDS_RECVERR socket option (see  below).   In  both
              cases, the format of the notification is the same; and at most one notification will be sent per completed operation.

              The message format is this:

              struct rds_rdma_notify {
                      u_int32_t       user_token;
                      int32_t         status;
              };

              The  user_token  field contains the value previously given to the kernel in the RDS_CMSG_RDMA_ARGS control message. The status field
              contains a status value, with 0 indicating success, and non-zero indicating an error.

              The following status codes are currently defined:

              RDS_RDMA_SUCCESS
                     The RDMA operation succeeded.

              RDS_RDMA_REMOTE_ERROR
                     The RDMA operation failed due to a remote access error. This is usually due to an invalid R_key, offset or transfer size.

              RDS_RDMA_CANCELED
                     The RDMA operation was canceled by the application.  (This error code is not yet generated).

              RDS_RDMA_DROPPED
                     RDMA operations were discarded after the connection broke and was re-established. The RDMA operation may have been  processed
                     partially.

              RDS_RDMA_OTHER_ERROR
                     Any other failure.

       RDMA setsockopt arguments
              When  using  the  RDS_GET_MR  socket option to register a memory range, the application passes a pointer to a struct rds_get_mr_args
              variable, described above.

              The RDS_FREE_MR call takes an argument of type struct rds_free_mr_args:

              struct rds_free_mr_args {
                      rds_rdma_cookie_t cookie;
                      u_int64_t       flags;
              };

              cookie specifies the RDMA cookie to be released. RDMA access to the memory range will usually not be invoked instantly, because  the
              operation  is  rather costly. However, if the flags argument contains RDS_RDMA_INVALIDATE, RDS will invalidate the indicated mapping
              immediately, as described in section Mapping arguments above.

              If the cookie argument is 0, and RDS_RDMA_INVALIDATE is set, RDS will invalidate old memory mappings on all devices.

ERRORS

       In addition to the usual error codes returned by sendmsg, recvmsg and setsockopt, RDS returns the following error codes:

       EAGAIN RDS was unable to map a memory range because the limit was exceeded (returned by RDS_CMSG_RDMA_MAP and RDS_GET_MR).

       EINVAL When sending a message, there were were conflicting control messages (e.g. two RDMA_MAP messages, or a RDMA_MAP   and  a   RDMA_DEST
              message).

              In a RDS_CMSG_RDMA_MAP or RDS_GET_MR operation, the application specified memory range greater than the maximum size supported.

              When  setting  up an RDMA operation with RDS_CMSG_RDMA_ARGS, the size of the local memory (given in the rds_iovec) did not match the
              size of the remote memory range.

       EBUSY  RDS was unable to obtain a DMA mapping for the indicated memory.

LIMITS

       Currently, the following limits apply

       o      The maximum size of a zerocopy transfer is 1MB. This can be adjusted via the fmr_message_size module parameter.

       o      The maximum number of memory ranges that can be mapped is limited to 2048 at the moment. This can be adjusted via the  fmr_pool_size
              module parameter. However, the actual limit imposed by the hardware may in fact be lower.

AUTHORS

       RDS was written and is Copyright (C) 2007-2008 by Oracle, Inc.

                                                                                                                                   RDS zerocopy(7)
Linux and UNIX Man Pages

rds-rdma(7) [linux man page]