Heterogeneous Memory and CXL

Heterogeneous Memory and CXL is the emerging memory architecture that connects different types of memory (DRAM, HBM, persistent memory, storage-class memory) through standardized interconnects into a unified, tiered memory hierarchy accessible to CPUs, GPUs, and accelerators — enabling memory capacity and bandwidth to scale independently of the processor, addressing the fundamental constraint that traditional memory channels limit both capacity and bandwidth. CXL (Compute Express Link) is the industry-standard protocol enabling this interconnect fabric.

The Memory Capacity Problem

- Modern CPU DRAM: 8–12 channels × 64 GB/channel = 512–768 GB per socket maximum.
- AI training: GPT-4 class model requires 1–2 TB for weights + KV cache → exceeds single-socket DRAM.
- Database servers: In-memory databases with multi-TB datasets → need more capacity than DRAM channels allow.
- Solution: Add memory capacity beyond DRAM channels via CXL-attached memory expanders.

CXL (Compute Express Link)

- Open standard (CXL Consortium: Intel, AMD, ARM, NVIDIA, Samsung, Micron, SK Hynix, etc.).
- Physical layer: PCIe 5.0 or 6.0 — uses existing PCIe infrastructure.
- Protocol layer: Three sub-protocols:
- CXL.io: PCIe-compatible I/O (device config, interrupts).
- CXL.cache: Accelerator caches host memory — bidirectional cache coherence.
- CXL.mem: Host accesses device memory — accelerator exposes memory to host.

CXL Device Types

| Type | CXL Protocols | Use Case |
|------|--------------|----------|
| Type 1 | CXL.io + CXL.cache | SmartNIC, FPGA (cache host memory) |
| Type 2 | CXL.io + CXL.cache + CXL.mem | GPU, accelerator (bidirectional) |
| Type 3 | CXL.io + CXL.mem | Memory expander (add DRAM capacity) |

CXL Memory Expander

- DIMM-like device that connects via PCIe slot → adds 256 GB – 2 TB of DRAM to a server.
- Host CPU accesses CXL memory transparently → appears as NUMA node.
- Latency: ~150–300 ns (vs. 75–90 ns for local DRAM) → acceptable for capacity-sensitive, latency-tolerant workloads.
- Bandwidth: ~50–60 GB/s per CXL link (PCIe 5.0 × 16) → less than DDR5 (51 GB/s per channel × 8–12 channels).
- Use case: Tiered memory — hot data in local DRAM, warm data in CXL DRAM.

Memory Tiering

``Processor ← → L3 Cache (on-chip) ← → Local DRAM (DDR5): 512 GB, 75 ns, 400 GB/s ← → CXL DRAM (Type 3): 2 TB, 200 ns, 50 GB/s ← → NVMe SSD (via PCIe): 64 TB, 100 µs, 7 GB/s`

- OS tiering: Linux NUMA balancing, tierddaemon — migrate hot pages to fast tier, cold pages to slow tier. - Application-aware tiering: Programmer hints viamadvise(), mbind() → place specific data in specific tier.

CXL Switch and Fabric

- CXL 2.0: CXL switches → multiple devices/memory pools → host can access pools non-exclusively. - CXL 3.0: Fabric → direct device-to-device communication, shared memory across multiple hosts. - Memory pooling: One large CXL memory pool shared across multiple servers → allocate on demand. - Benefit: Server memory utilization improves (no stranded memory) → lower TCO.

HBM on CPU/APU

- AMD MI300X: 192 GB HBM3 integrated with compute dies → highest bandwidth memory for AI (5.2 TB/s). - Intel Sapphire Rapids HBM: Xeon + HBM on same package → CPU can use HBM as last-level cache or address directly. - Benefits: Lower latency than external DRAM (on-package), much higher bandwidth.

NUMA Programming for Heterogeneous Memory

- Each memory tier is a NUMA node → access with numa_alloc_onnode(), mbind(), numactl`.
- Profile memory access patterns → identify hot vs. cold data → manually bind hot data to HBM/local DRAM.
- Transparent HBM: OS automatically uses HBM as cache → application-transparent performance boost.

Heterogeneous memory and CXL represent the next architectural revolution in computing infrastructure — by decoupling memory capacity from compute nodes and enabling memory to scale independently via standardized CXL fabric, this technology enables AI servers to access terabytes of memory economically, database systems to hold entire datasets in DRAM tiers, and hyperscale clouds to dramatically improve memory utilization across fleets, addressing the memory capacity wall that threatens to limit AI and data-intensive application growth at a time when model sizes and dataset scales are growing faster than any other dimension of computing.

Want to learn more?