CUDA Unified Memory Management

CUDA Unified Memory Management is a memory architecture feature that creates a single coherent virtual address space accessible by both CPU and GPU, with the CUDA runtime automatically migrating pages between host and device memory on demand — this dramatically simplifies GPU programming by eliminating the need for explicit cudaMemcpy calls while still achieving near-optimal performance with proper prefetching.

Unified Memory Fundamentals:
- cudaMallocManaged: allocates memory accessible from both CPU and GPU code using the same pointer — the runtime system handles physical page placement and migration transparently
- Page Faulting: when the GPU accesses a page residing in CPU memory (or vice versa), a page fault triggers automatic migration — initial access incurs fault handling latency (10-50 µs per page) but subsequent accesses are at full bandwidth
- Page Size: managed memory uses 4KB pages on CPU and 64KB pages on GPU (since Pascal architecture) — larger GPU pages amortize fault overhead but increase migration granularity
- Oversubscription: unified memory allows allocations exceeding GPU physical memory — pages are evicted to CPU memory under pressure, enabling workloads that wouldn't otherwise fit on the GPU

Migration and Prefetching:
- On-Demand Migration: pages migrate to the accessing processor on first touch — creates initial performance penalties but enables correct execution without programmer intervention
- Explicit Prefetching: cudaMemPrefetchAsync() migrates pages to a specified device before they're needed — eliminates page fault latency and achieves bandwidth utilization comparable to explicit cudaMemcpy
- Access Hints: cudaMemAdvise() provides hints about memory access patterns — cudaMemAdviseSetPreferredLocation pins pages to a device, cudaMemAdviseSetReadMostly creates read-only replicas on accessing devices
- Thrashing Prevention: when CPU and GPU repeatedly access the same pages, thrashing degrades performance — preferred location hints and read-mostly flags eliminate unnecessary migrations

Architecture Evolution:
- Kepler (CC 3.0): introduced Unified Virtual Addressing (UVA) — single address space but no automatic migration, programmer must still manage transfers
- Pascal (CC 6.0): true unified memory with hardware page faulting on GPU — first architecture supporting on-demand page migration and memory oversubscription
- Volta (CC 7.0): added Access Counter-Based Migration — hardware counters track access frequency and automatically migrate hot pages to the accessing processor without explicit prefetch hints
- Hopper (CC 9.0): Confidential Computing support for unified memory, hardware-accelerated page migration with reduced fault latency (<5 µs)

Performance Optimization Patterns:
- Initialization on GPU: allocate with cudaMallocManaged, initialize data on GPU (first-touch places pages in GPU memory) — avoids CPU-to-GPU migration entirely
- Prefetch Before Kernel Launch: call cudaMemPrefetchAsync for all input data, launch kernel, prefetch output back to CPU — overlaps migration with computation on streams
- Structure of Arrays: SoA layout enables efficient prefetching of individual arrays — Array of Structures forces entire structure pages to migrate even when only one field is accessed per kernel
- Multi-GPU Access: unified memory works across multiple GPUs with peer-to-peer access — pages migrate to the GPU that accesses them most frequently, enabling dynamic load balancing

Comparison with Explicit Memory Management:
- Development Productivity: unified memory reduces typical CUDA memory management code by 60-70% — eliminates cudaMalloc/cudaMemcpy/cudaFree boilerplate and simplifies data structures with pointers
- Performance Without Hints: naive unified memory typically achieves 70-85% of explicit management performance due to page fault overhead — acceptable for prototyping and development
- Performance With Prefetching: properly prefetched unified memory matches explicit cudaMemcpy performance within 1-3% — achieves full PCIe or NVLink bandwidth utilization
- Complex Data Structures: linked lists, trees, and graphs work naturally with unified memory — explicit management requires deep-copy serialization or structure flattening

Unified memory doesn't replace the need to understand GPU memory architecture — achieving peak performance still requires awareness of access patterns, prefetching, and page placement — but it provides a dramatically simpler programming model that scales from rapid prototyping to production-quality GPU applications.

Want to learn more?