NUMA-Aware Memory Allocation

NUMA-Aware Memory Allocation — Optimizing memory placement and access patterns on Non-Uniform Memory Access architectures where memory latency and bandwidth depend on the physical proximity between processors and memory banks.

NUMA Architecture Fundamentals — Modern multi-socket servers organize processors and memory into NUMA nodes, each containing a subset of CPU cores and locally attached DRAM. Accessing local memory within the same NUMA node is significantly faster than remote access across the interconnect. The latency ratio between remote and local access typically ranges from 1.5x to 3x depending on the number of hops. Memory bandwidth is similarly affected, with local bandwidth often 2-3x higher than remote bandwidth per core.

Allocation Policies and Strategies — First-touch policy allocates physical pages on the NUMA node where the thread first accesses the virtual address, making initialization patterns critical. Interleave policy distributes pages round-robin across all NUMA nodes, providing uniform average latency at the cost of losing locality benefits. Bind policy forces allocation to specific NUMA nodes regardless of which thread accesses the data. Linux provides numactl for process-level control and libnuma for programmatic fine-grained allocation with numa_alloc_onnode() and numa_alloc_interleaved() calls.

Thread and Memory Affinity — Binding threads to specific cores using pthread_setaffinity_np() or hwloc ensures consistent NUMA node placement. Memory-intensive parallel loops should partition data so each thread primarily accesses memory allocated on its local NUMA node. OpenMP provides OMP_PLACES and OMP_PROC_BIND environment variables for portable affinity control. The combination of thread pinning and first-touch allocation creates a natural alignment between computation and data placement.

Performance Diagnosis and Tuning — Hardware performance counters track local versus remote memory accesses through events like numa_hit and numa_miss. Tools such as numastat, perf, and Intel VTune quantify NUMA effects on application performance. Page migration using move_pages() or automatic NUMA balancing in Linux can correct suboptimal initial placement. Memory-intensive applications can see 30-50% performance improvement from proper NUMA-aware allocation compared to naive placement.

NUMA-aware memory allocation is essential for extracting full performance from modern multi-socket servers, directly impacting the scalability of memory-intensive parallel workloads.

Want to learn more?