NUMA-Aware Memory Allocation is the practice of placing memory pages on the NUMA (Non-Uniform Memory Access) node closest to the processor that will most frequently access them, minimizing memory latency and maximizing bandwidth for parallel applications — on modern multi-socket servers, ignoring NUMA topology can cause 2-3× performance degradation due to remote memory access penalties.
NUMA Architecture Fundamentals:
- Memory Locality: each processor socket has directly attached memory (local DRAM) — accessing local memory takes 80-100 ns, while accessing memory on another socket (remote) takes 130-200 ns, a 1.5-2× latency penalty
- Bandwidth Asymmetry: local memory bandwidth per socket is typically 100-200 GB/s (DDR5), while the inter-socket interconnect (UPI, Infinity Fabric) provides 50-100 GB/s — remote bandwidth is 50-70% of local
- NUMA Node: a processor socket and its local memory form a NUMA node — a dual-socket server has 2 NUMA nodes, a quad-socket has 4, and AMD EPYC processors expose multiple NUMA nodes per socket (NPS4 mode creates 4 nodes per socket)
- Topology Discovery: numactl --hardware displays the system's NUMA topology — shows node distances, memory sizes, and CPU-to-node mappings
Linux NUMA Memory Policies:
- First-Touch: the default policy — memory pages are allocated on the NUMA node of the processor that first writes to them — effective when initialization and computation happen on the same threads
- Interleave: pages are distributed round-robin across specified NUMA nodes — provides uniform average latency and balances memory bandwidth across nodes — ideal for shared data structures accessed by all threads
- Bind: restricts allocation to specified NUMA nodes — ensures data stays local even if threads migrate — used with process pinning to guarantee locality
- Preferred: attempts allocation on the specified node but falls back to others if memory is exhausted — softer constraint than bind, prevents out-of-memory failures on overcommitted nodes
Programming APIs:
- numactl Command: numactl --membind=0 --cpunodebind=0 ./program — pins both threads and memory to node 0 — simplest approach requiring no code changes
- libnuma (numa_alloc_onnode): programmatic NUMA allocation — numa_alloc_onnode(size, node) allocates size bytes on the specified NUMA node, enabling fine-grained per-object placement
- mbind System Call: sets NUMA policy for specific memory ranges — MPOL_BIND, MPOL_INTERLEAVE, MPOL_PREFERRED flags with a node mask specifying allowed nodes
- mmap with NUMA: combine mmap(MAP_ANONYMOUS) with mbind to create NUMA-aware memory regions — enables custom allocators with per-page NUMA control
Parallel Programming Patterns:
- Parallel First-Touch Initialization: initialize arrays in a parallel loop with the same thread-to-data mapping as the computation — each thread touches its portion first, placing pages on the correct NUMA node — dramatically improves performance compared to serial initialization
- Socket-Aware Thread Binding: pin OpenMP threads to specific cores with OMP_PLACES=cores and OMP_PROC_BIND=close — ensures threads and their data remain on the same NUMA node throughout execution
- Per-Node Data Structures: allocate separate copies of shared data structures on each NUMA node — threads access their node-local copy, periodic synchronization merges results
- NUMA-Aware Memory Pools: custom allocators maintain per-node free lists — thread-local allocation draws from the local node's pool, eliminating cross-node allocation overhead
Common Pitfalls:
- Serial Initialization: initializing a large array in the main thread places all pages on node 0 (first-touch) — subsequent parallel access from node 1 threads incurs remote latency for every access
- Thread Migration: if the OS migrates a thread to a different NUMA node, its previously local memory becomes remote — use taskset, pthread_setaffinity_np, or cgroup cpusets to prevent migration
- Memory Balancing: Linux's automatic NUMA balancing (AutoNUMA) migrates pages to reduce remote accesses — can help but also adds overhead from page scanning and migration, sometimes hurting performance
- Transparent Huge Pages (THP): 2MB huge pages reduce TLB misses but make NUMA migration more expensive — a single misplaced 2MB page wastes more bandwidth than a misplaced 4KB page
Diagnosis and Monitoring:
- numastat: displays per-node memory allocation statistics — numa_miss and numa_foreign counters reveal cross-node allocation failures
- perf stat: hardware performance counters track local vs. remote memory accesses — high remote access ratios indicate NUMA placement problems
- Intel VTune: NUMA analysis view correlates memory access latency with thread placement — identifies specific data structures causing remote access bottlenecks
NUMA-aware programming transforms memory access from a random-latency operation into a predictable low-latency one — for memory-bandwidth-bound applications (which includes most HPC and data analytics workloads), proper NUMA placement is the single largest performance optimization after basic parallelization.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.