NUMA-Aware Programming

Keywords: numa aware programming,memory binding,libnuma,numa topology,numa optimization

NUMA-Aware Programming is the practice of allocating and accessing memory in ways that minimize cross-NUMA-node memory accesses — exploiting the topology of Non-Uniform Memory Access systems to reduce memory latency and increase bandwidth.

NUMA Topology

- Modern servers: 2–8 NUMA nodes, each node has CPUs + local DRAM.
- Local access: CPU accesses DRAM on same node — 80–100ns, full bandwidth.
- Remote access: CPU accesses DRAM on different node via QPI/UPI/Infinity Fabric — 150–300ns, reduced bandwidth.
- Remote penalty: 2–4x slower than local access.

Detecting NUMA Topology

``bash
numactl --hardware # Show nodes, CPUs per node, memory
lscpu | grep NUMA # NUMA node count
numastat # NUMA hit/miss statistics per process
`

Memory Allocation Policies

`c
#include <numa.h>

// Allocate on current node (first-touch policy — default)
void* p = malloc(size); // Allocated on node that first accesses it

// Explicit node allocation
void* p = numa_alloc_onnode(size, node_id);

// Interleave across all nodes (good for shared data)
void* p = numa_alloc_interleaved(size);

// Bind thread to node
numa_run_on_node(node_id);
`

First-Touch Policy

- Default Linux policy: Allocate on node where memory is first accessed.
- Pitfall: If main thread initializes data, it all lands on main thread's node.
- NUMA-aware initialization: Have each thread initialize its own portion.

Thread Pinning (CPU Affinity)

`c
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
`

- Pin thread to specific cores on specific NUMA node → predictable local memory access.
- Use with NUMA allocation: Thread pinned to node 0 + memory allocated on node 0 = local.

NUMA Impact on MPI

- MPI rank-to-core binding: Place communicating ranks on same NUMA node.
- OpenMPI:
--bind-to core --map-by socket` controls NUMA-aware placement.

NUMA-aware programming is a critical optimization for multi-socket server workloads — database servers, HPC simulations, and in-memory analytics routinely achieve 2-3x performance improvements by aligning memory allocation with memory access patterns.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT