NVLink is NVIDIA's high-bandwidth interconnect for GPU-to-GPU and GPU-to-CPU communication — providing 600-900 GB/s bidirectional bandwidth compared to PCIe's 64 GB/s, enabling efficient multi-GPU scaling for large model training and inference.
What Is NVLink?
- Definition: Proprietary high-speed GPU interconnect.
- Purpose: Fast multi-GPU communication.
- Bandwidth: 10-14× faster than PCIe Gen5.
- Use Cases: Multi-GPU training, large model sharding.
Why NVLink Matters
- Model Parallelism: Large models span multiple GPUs.
- Gradient Sync: Training requires fast parameter updates.
- Memory Pooling: Access memory across GPUs.
- Inference: Large models need GPU sharding.
- Scaling Efficiency: Minimizes communication bottleneck.
Bandwidth Comparison
Interconnect Speeds:
Interconnect | Bandwidth (Bi-dir) | Generation
------------------|-------------------|------------
NVLink 4 (Hopper) | 900 GB/s | H100
NVLink 3 (Ampere) | 600 GB/s | A100
NVLink 2 (Volta) | 300 GB/s | V100
PCIe Gen5 | 64 GB/s (×16) | Current
PCIe Gen4 | 32 GB/s (×16) | Previous
InfiniBand NDR | 400 Gbps per port | Network
Practical Impact:
Operation | PCIe Gen5 | NVLink 4
-----------------------|--------------|----------
Copy 80GB (A100 mem) | 1.25 sec | 0.13 sec
Gradient sync (10GB) | 156 ms | 11 ms
AllReduce efficiency | 70-80% | 95%+
NVLink Topologies
DGX H100 Topology:
8× H100 GPUs with NVSwitch
┌───────────────────────────────────┐
│ NVSwitch Fabric │
│ (Full bisection bandwidth) │
└───────────────────────────────────┘
│ │ │ │ │ │ │ │
┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐
│H ││H ││H ││H ││H ││H ││H ││H │
│1 ││2 ││3 ││4 ││5 ││6 ││7 ││8 │
│0 ││0 ││0 ││0 ││0 ││0 ││0 ││0 │
│0 ││0 ││0 ││0 ││0 ││0 ││0 ││0 │
└──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘
Any GPU can talk to any GPU at full bandwidth
Consumer NVLink (RTX 4090):
3090: NVLink bridge, 2 GPUs
4090: No NVLink support
NVSwitch
What It Enables:
Without NVSwitch:
- Direct links only between neighbor GPUs
- Limited topology
With NVSwitch:
- All-to-all connectivity
- Full bisection bandwidth
- Any GPU reaches any GPU directly
DGX Generations:
System | GPUs | Topology | GPU-GPU BW
-------------|------|---------------------|------------
DGX A100 | 8 | NVSwitch (full) | 600 GB/s
DGX H100 | 8 | NVSwitch (full) | 900 GB/s
DGX GH200 | 256 | Grace Hopper + NVL | 900 GB/s
Programming with NVLink
NCCL (NVIDIA Collective Communications Library):
import torch
import torch.distributed as dist
# Initialize with NCCL backend (uses NVLink automatically)
dist.init_process_group(backend="nccl")
# AllReduce uses NVLink when available
tensor = torch.randn(1000, device="cuda")
dist.all_reduce(tensor) # Automatically uses NVLink
Peer-to-Peer Memory Access:
// Enable P2P access between GPUs
cudaDeviceEnablePeerAccess(peer_device, 0);
// Direct memory access across NVLink
cudaMemcpyPeer(dst, dstDevice, src, srcDevice, size);
Checking NVLink:
# Check NVLink status
nvidia-smi nvlink -s
# Show topology
nvidia-smi topo -m
# NVLink utilization
nvidia-smi nvlink -g 0
NVLink vs. PCIe Use Cases
Use Case | Best Interconnect
----------------------|------------------
Single GPU inference | PCIe (sufficient)
Multi-GPU training | NVLink (essential)
Large model inference | NVLink (model sharding)
Consumer workstation | PCIe (NVLink limited)
Data center | NVLink + InfiniBand
NVLink is essential infrastructure for multi-GPU AI — without high-bandwidth interconnects, scaling to multiple GPUs becomes inefficient as communication overhead dominates, making NVLink critical for training large models and serving them across GPU clusters.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.