NUMA-Aware Programming

NUMA-Aware Programming is the practice of allocating and accessing memory in ways that minimize cross-NUMA-node memory accesses — exploiting the topology of Non-Uniform Memory Access systems to reduce memory latency and increase bandwidth.

NUMA Topology

- Modern servers: 2–8 NUMA nodes, each node has CPUs + local DRAM.
- Local access: CPU accesses DRAM on same node — 80–100ns, full bandwidth.
- Remote access: CPU accesses DRAM on different node via QPI/UPI/Infinity Fabric — 150–300ns, reduced bandwidth.
- Remote penalty: 2–4x slower than local access.

Detecting NUMA Topology

``bash numactl --hardware # Show nodes, CPUs per node, memory lscpu | grep NUMA # NUMA node count numastat # NUMA hit/miss statistics per process`

Memory Allocation Policies

`c #include <numa.h>

// Allocate on current node (first-touch policy — default) void* p = malloc(size); // Allocated on node that first accesses it

// Explicit node allocation void* p = numa_alloc_onnode(size, node_id);

// Interleave across all nodes (good for shared data) void* p = numa_alloc_interleaved(size);

// Bind thread to node numa_run_on_node(node_id);`

First-Touch Policy

- Default Linux policy: Allocate on node where memory is first accessed. - Pitfall: If main thread initializes data, it all lands on main thread's node. - NUMA-aware initialization: Have each thread initialize its own portion.

Thread Pinning (CPU Affinity)

`c cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core_id, &cpuset); pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);`

- Pin thread to specific cores on specific NUMA node → predictable local memory access. - Use with NUMA allocation: Thread pinned to node 0 + memory allocated on node 0 = local.

NUMA Impact on MPI

- MPI rank-to-core binding: Place communicating ranks on same NUMA node. - OpenMPI:--bind-to core --map-by socket` controls NUMA-aware placement.

NUMA-aware programming is a critical optimization for multi-socket server workloads — database servers, HPC simulations, and in-memory analytics routinely achieve 2-3x performance improvements by aligning memory allocation with memory access patterns.

Want to learn more?