numa_intro(3) Library Functions Manual numa_intro(3)
NAME
numa_intro - Introduction to NUMA support
DESCRIPTION
NUMA, or Non-Uniform Memory Access, refers to a hardware architectural feature in modern multi-processor platforms that attempts to
address the increasing disparity between requirements for processor speed and bandwidth and the bandwidth capabilities of memory systems,
including the interconnect between processors and memory. NUMA systems address this problem by grouping resources--processors, I/O busses,
and memory--into building blocks that balance an appropriate number of processors and I/O busses with a local memory system that delivers
the necessary bandwidth. The local building blocks are combined into a larger system by means of a system level interconnect with a plat-
form-specific topology.
The local processor and I/O components on a particular building block can access their own "local" memory with the lowest possible latency
for a particular system design. The local building block can in turn access the resources (processors, I/O, and memory) of remote building
blocks at the cost of increased access latency and decreased global access bandwidth. The term "Non-Uniform Memory Access" refers to the
difference in latency between "local" and "remote" memory accesses that can occur on a NUMA platform.
Overall system throughput and individual application performance is optimized on a NUMA platform by maximizing the ratio of local resource
accesses to remote accesses. This is achieved by recognizing and preserving the "affinity" that processes have for the various resources on
the system building blocks. For this reason, the building blocks are called "Resource Affinity Domains" or RADs.
RADs are supported only on a class of platforms known as Cache Coherent NUMA, or CC NUMA, where all memory is accessible and cache coherent
with respect to all processors and I/O busses. The Tru64 UNIX operating system includes enhancements to optimize system throughput and
application performance on CC NUMA platforms for legacy applications as well as those that use NUMA aware APIs. System enhancements to sup-
port NUMA are discussed in the following subsections. Along with system performance monitoring and tuning facilities, these enhancements
allow the operating system to make a "best effort" to optimize the performance of any given collection of applications or application com-
ponents on a CC-NUMA platform.
NUMA Enhancements to Basic UNIX Algorithms and Default Behaviors
For NUMA, modifications to basic UNIX algorithms (scheduling, memory allocation, and so forth) and to default behaviors maximize local
accesses transparently to applications. These modifications, which include the following, directly benefit legacy and non-NUMA-aware appli-
cations that were designed for uniprocessors or Uniform Memory Access Symmetric Multiprocessors but run on CC NUMA platforms: Topology-
aware placement of data
The operating system attempts to allocate memory for application (and kernel) data on the RAD closest to where the data will be
accessed; or, for data that is globally accessed, the operating system may allocate memory across the available RADs. When there is
insufficient free memory on optimal RADs, the memory allocations for data may "overflow" onto nearby RADs. Replication of read-only
code and data
The operating system will attempt to make a local copy of read-only data, such as shared program and library code. Kernel code and
kernel read-only data are replicated on all RADs at boot time. If insufficient free local memory is available, the operating system
may choose to utilize a remote copy rather than wait for free local memory. Memory affinity-aware scheduling
The operating system scheduler takes "cache affinity" into account when choosing a processor to run a process thread on multiproces-
sor platforms. Cache affinity assumes that a process thread builds a "memory footprint" in a particular processor's cache. On CC
NUMA platforms, the scheduler also takes into account the fact that processes will have memory allocated on particular RADs, and
will attempt to keep processes running on processors that are in the same RAD as their memory footprints. Load balancing
To minimize the requirement for remote memory allocation (overflow), the scheduler will take into account memory availability on a
RAD as well as the processor load average for the RAD. Although these two factors may at times conflict with one another, the
scheduler will attempt to balance the load so that processes run where there are memory pages as well as processor cycles available.
This balancing involves both the initial selection of a RAD at process creation and migration of processes or individual pages in
response to changing loads as processes come and go or their resource requirements or access patterns change.
NUMA Enhancements to Application Programming Interfaces
Application programmers can use new or modified library routines to further increase local accesses on CC NUMA platforms. Using these APIs,
programmers can write new applications or modify old ones to provide additional information to the operating system or to take explicit
control over process, thread, memory object placement, or some combination of these. NUMA aware routines are included in the following
libraries: The Standard C Library (libc) The POSIX Threads Library (libpthread) The NUMA Library (libnuma)
The reference pages that document NUMA-aware APIs note their library location.
SEE ALSO
Files: numa_types(4)
numa_intro(3)