NUMA Architecture and Memory Affinity

NUMA Architecture and Memory Affinity enable explicit placement of data and threads on multi-socket systems to exploit local memory bandwidth and latency, critical for HPC and data-center applications scaling to 100s of cores.

Non-Uniform Memory Access Topology

- NUMA Organization: Multiple sockets (CPUs), each with local memory attached. Local socket memory ~100ns latency, remote socket memory ~200-400ns (2-4x penalty).
- Memory Bandwidth Asymmetry: Local DRAM bandwidth (say 100 GB/s) shared with other local cores. Remote DRAM bandwidth crossed via QPI/Infinity Fabric interconnect (less bandwidth than local).
- Example Topology: Dual-socket Xeon with 32 cores per socket. Each core can access both socket's memory, but local access preferred.
- UMA vs NUMA: Older systems uniform memory access (UMA) via shared front-side bus. Modern systems inherently NUMA due to scaling limitations of centralized memory controller.

NUMA Node Binding and Thread Affinity

- NUMA Node Definition: Logical grouping of cores + associated memory. Socket-based binding: threads pinned to cores in same socket as their data.
- numactl Command: numactl --membind=node0 --cpunodebind=node0 application. Forces threads/memory to specific NUMA node. Prevents OS migration.
- libnuma Library: Programmatic NUMA control. numa_alloc_onnode(), numa_bind(), numa_set_preferred(). Enables application-level NUMA awareness.
- cpuset Cgroups: Linux control groups restrict processes to CPU/memory subsets. System-wide NUMA orchestration via cgroups.

First-Touch Policy

- Memory Allocation Mechanism: Pages allocated to NUMA node of thread first accessing page (write). OS tracks page residency.
- Default Behavior: malloc() allocates from kernel's allocator, typically interleaved across nodes (round-robin). Application overrides via numa_alloc_onnode().
- First-Touch Implication: Thread A allocates array B but doesn't initialize; Thread B initializes B. B ends up on B's node (correct affinity).
- Guideline: Initialize data on thread that will access it, or explicitly allocate on target node before other threads touch.

Remote vs Local Memory Latency Impact

- Latency Difference: Local ~100ns, remote ~300ns (3x penalty). Impacts iterative workloads (large loop counts × remote access = significant slowdown).
- Bandwidth Scaling: Remote bandwidth congested by all-to-all access patterns. Single-socket bandwidth ~100 GB/s; multi-socket aggregate ~150-200 GB/s (sub-linear).
- Cache Effects: L3 cache (8-20 MB per socket) mitigates some remote access penalties. If working set fits in L3, remote penalty minimal.
- Example Impact: 1000-iteration loop accessing remote memory: 1000 × 200ns = 200µs (remote) vs 100µs (local). 2x slowdown possible.

NUMA-Aware Data Structures

- Replicated Data: Hot data replicated per socket (each socket has copy). Slight memory overhead but eliminates remote access.
- Data Partitioning: Divide large arrays by NUMA node. Thread i processes array[i×partition_size:(i+1)×partition_size]. Guarantees local access.
- Hash Table Striping: Hash table buckets assigned to NUMA nodes. Hash function distributes keys across nodes balancing load and access locality.
- Graph Partitioning: Graph algorithms (matrix computations, machine learning) partition vertices/edges by NUMA locality. Minimize cross-node edges.

Memory Interleaving vs Binding

- Interleaved Mode: OS spreads pages round-robin across NUMA nodes. Balances memory usage but serializes remote access across all nodes. Poor latency.
- Bound Mode: Pages allocated on specific node. Requires explicit NUMA awareness (application or numactl). Excellent latency but requires work distribution matching binding.
- Hybrid Approaches: Bind hot/critical data to local node, interleave cold data. Best of both worlds.

NUMA Scheduling and OS Coordination

- OS NUMA Scheduler: Linux kernel scheduler (CFS) considers NUMA locality. Migrates threads toward memory (if cheaper than migrating memory).
- Task Scheduler Trade-offs: Migrate thread (cache cold) vs keep thread (remote memory). Decision based on current load, task runtime, memory intensity.
- AutoNUMA: Linux feature periodically migrates pages toward threads that access them (and vice versa). Reduces manual tuning but adds overhead.

NUMA in Multi-Socket HPC Servers

- Dual/Quad Socket Systems: 2-4 sockets per server, 64-256 cores total. Typical HPC configuration in data centers.
- Binding Strategy: MPI ranks bound to NUMA nodes (one rank per node). Inter-rank communication via network (InfiniBand) not NUMA crossings.
- Memory Scaling: Dual-socket Xeon: 256 GB-1 TB memory (128GB-512GB per socket). Single-node jobs fit; larger jobs spill to other nodes (network-based, slower).
- Benchmark Sensitivity: STREAM benchmark 5-10x slower on remote nodes vs local. Gemm (compute-bound) largely unaffected by NUMA.

NUMA Architecture and Memory Affinity

Want to learn more?