NVLink Interconnect

NVLink Interconnect is NVIDIA's proprietary high-bandwidth, low-latency GPU-to-GPU interconnect that provides 10-15× higher bandwidth than PCIe — enabling direct GPU memory access at 900 GB/s bidirectional (NVLink 4.0) and sub-microsecond latency, making tightly-coupled multi-GPU systems practical for model parallelism, large-batch training, and unified memory architectures that treat multiple GPUs as a single coherent memory space.

NVLink Architecture:
- Physical Layer: high-speed serial links using PAM4 (4-level pulse amplitude modulation) signaling at 50 Gb/s per lane (NVLink 3.0) or 100 Gb/s (NVLink 4.0); each NVLink comprises multiple lanes bundled into a bidirectional connection
- Link Configuration: H100 GPUs have 18 NVLink connections, each providing 50 GB/s bidirectional (25 GB/s each direction); total 900 GB/s bidirectional per GPU; A100 has 12 NVLinks at 600 GB/s total; compare to PCIe 5.0 x16 at 128 GB/s bidirectional
- Protocol: cache-coherent protocol supporting load/store semantics; GPUs can directly read/write remote GPU memory using standard CUDA memory operations; hardware handles address translation, routing, and coherency
- Topology Flexibility: NVLinks can connect GPUs in various topologies (ring, mesh, hypercube, fully-connected via NVSwitch); topology determines effective bandwidth between non-adjacent GPUs

NVSwitch Fabric:
- Switch Architecture: NVSwitch is a dedicated switch chip providing full non-blocking connectivity among GPUs; each NVSwitch has 64 NVLink ports (NVSwitch 3.0 in H100 systems); multiple NVSwitches create a two-tier fabric for larger GPU counts
- DGX H100 Configuration: 8 H100 GPUs connected via 4 NVSwitches; every GPU has direct NVLink path to every other GPU; 900 GB/s bidirectional bandwidth between any GPU pair; total fabric bandwidth 7.2 TB/s
- Scalability: DGX SuperPOD connects 32 DGX H100 nodes (256 GPUs) using InfiniBand for inter-node and NVLink for intra-node; hybrid topology optimizes for locality (NVLink for nearby GPUs, IB for distant GPUs)
- Comparison to Direct Connection: without NVSwitch, 8 GPUs in ring/mesh topology have non-uniform bandwidth (adjacent GPUs: 900 GB/s, distant GPUs: 225-450 GB/s); NVSwitch provides uniform 900 GB/s between all pairs

Performance Characteristics:
- Bandwidth: NVLink 4.0 delivers 900 GB/s bidirectional per GPU; 14× higher than PCIe 5.0 x16 (64 GB/s); enables model parallelism where layer outputs (multi-GB activations) transfer between GPUs every forward/backward pass
- Latency: GPU-to-GPU load/store latency <1μs over NVLink vs 3-5μs over PCIe; low latency critical for fine-grained parallelism (tensor parallelism with frequent small transfers)
- CPU Overhead: NVLink transfers initiated by GPU without CPU involvement; cudaMemcpy() between peer GPUs uses NVLink automatically; zero CPU cycles consumed for GPU-to-GPU communication
- Coherency: NVLink supports cache-coherent memory access; GPU can cache remote GPU memory in its L2; reduces latency for repeated accesses to same remote data; coherency protocol ensures consistency across GPU caches

Programming Model:
- Peer Access: cudaDeviceEnablePeerAccess() enables direct addressing; GPU 0 can use device pointers from GPU 1 directly in kernels; cudaMemcpy() automatically uses NVLink for peer transfers
- Unified Memory: with NVLink, Unified Memory (cudaMallocManaged) provides single address space across GPUs; page migration and coherency handled by hardware/driver; simplifies multi-GPU programming but may have performance overhead from page faults
- NCCL Optimization: NCCL detects NVLink topology and uses optimized algorithms; ring all-reduce over NVLink achieves 95%+ of theoretical bandwidth; tree algorithms for NVSwitch topologies exploit full bisection bandwidth
- Explicit Topology Control: NCCL_TOPO_FILE environment variable specifies custom topology; enables manual optimization for non-standard configurations; useful for debugging performance issues or testing different communication patterns

Use Cases and Benefits:
- Model Parallelism: split large models (GPT-3, Megatron) across GPUs; layer outputs (activation tensors) transfer over NVLink every forward/backward pass; 900 GB/s enables model parallelism with <10% communication overhead
- Pipeline Parallelism: different layers on different GPUs; micro-batches flow through pipeline; NVLink bandwidth enables fine-grained pipelines (small micro-batches) with high throughput
- Data Parallelism: gradient all-reduce over NVLink; 8-GPU all-reduce completes in <1ms for billion-parameter models; enables large batch sizes (global batch = 8× per-GPU batch) without communication bottleneck
- Large Batch Training: NVLink enables efficient batch splitting across GPUs; each GPU processes subset of batch, exchanges activations/gradients; 900 GB/s supports batch sizes of 10,000+ images for vision models

Limitations and Considerations:
- Proprietary Technology: NVLink only connects NVIDIA GPUs; vendor lock-in limits flexibility; AMD Infinity Fabric and Intel Xe Link are competing technologies but less mature
- Distance Limitations: NVLink cables limited to ~2m; restricts GPU placement to single chassis or adjacent racks; inter-rack communication requires InfiniBand or Ethernet
- Cost: NVSwitch adds significant cost ($10K+ per switch); DGX systems with NVSwitch 2-3× more expensive than PCIe-only systems; cost justified only for workloads bottlenecked by GPU-to-GPU communication
- Topology Complexity: optimal NVLink topology depends on workload communication pattern; ring topology optimal for all-reduce, mesh for all-to-all, fully-connected (NVSwitch) for arbitrary patterns; misconfigured topology can leave bandwidth underutilized

NVLink is the interconnect that makes multi-GPU systems behave like single massive GPUs — by providing an order of magnitude more bandwidth than PCIe, NVLink enables model parallelism, large-batch training, and unified memory architectures that would be impractical with conventional interconnects, defining the architecture of modern AI supercomputers.

Want to learn more?