Home Knowledge Base NVLink

NVLink is NVIDIA's high-bandwidth interconnect for GPU-to-GPU and GPU-to-CPU communication — providing 600-900 GB/s bidirectional bandwidth compared to PCIe's 64 GB/s, enabling efficient multi-GPU scaling for large model training and inference.

What Is NVLink?

Why NVLink Matters

Bandwidth Comparison

Interconnect Speeds:

Interconnect      | Bandwidth (Bi-dir) | Generation
------------------|-------------------|------------
NVLink 4 (Hopper) | 900 GB/s          | H100
NVLink 3 (Ampere) | 600 GB/s          | A100
NVLink 2 (Volta)  | 300 GB/s          | V100
PCIe Gen5         | 64 GB/s (×16)     | Current
PCIe Gen4         | 32 GB/s (×16)     | Previous
InfiniBand NDR    | 400 Gbps per port | Network

Practical Impact:

Operation              | PCIe Gen5    | NVLink 4
-----------------------|--------------|----------
Copy 80GB (A100 mem)   | 1.25 sec     | 0.13 sec
Gradient sync (10GB)   | 156 ms       | 11 ms
AllReduce efficiency   | 70-80%       | 95%+

NVLink Topologies

DGX H100 Topology:

8× H100 GPUs with NVSwitch

    ┌───────────────────────────────────┐
    │         NVSwitch Fabric           │
    │   (Full bisection bandwidth)      │
    └───────────────────────────────────┘
      │    │    │    │    │    │    │    │
    ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐
    │H ││H ││H ││H ││H ││H ││H ││H │
    │1 ││2 ││3 ││4 ││5 ││6 ││7 ││8 │
    │0 ││0 ││0 ││0 ││0 ││0 ││0 ││0 │
    │0 ││0 ││0 ││0 ││0 ││0 ││0 ││0 │
    └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘

Any GPU can talk to any GPU at full bandwidth

Consumer NVLink (RTX 4090):

3090: NVLink bridge, 2 GPUs
4090: No NVLink support

NVSwitch

What It Enables:

Without NVSwitch:
- Direct links only between neighbor GPUs
- Limited topology

With NVSwitch:
- All-to-all connectivity
- Full bisection bandwidth
- Any GPU reaches any GPU directly

DGX Generations:

System       | GPUs | Topology            | GPU-GPU BW
-------------|------|---------------------|------------
DGX A100     | 8    | NVSwitch (full)     | 600 GB/s
DGX H100     | 8    | NVSwitch (full)     | 900 GB/s
DGX GH200    | 256  | Grace Hopper + NVL  | 900 GB/s

Programming with NVLink

NCCL (NVIDIA Collective Communications Library):

import torch
import torch.distributed as dist

# Initialize with NCCL backend (uses NVLink automatically)
dist.init_process_group(backend="nccl")

# AllReduce uses NVLink when available
tensor = torch.randn(1000, device="cuda")
dist.all_reduce(tensor)  # Automatically uses NVLink

Peer-to-Peer Memory Access:

// Enable P2P access between GPUs
cudaDeviceEnablePeerAccess(peer_device, 0);

// Direct memory access across NVLink
cudaMemcpyPeer(dst, dstDevice, src, srcDevice, size);

Checking NVLink:

# Check NVLink status
nvidia-smi nvlink -s

# Show topology
nvidia-smi topo -m

# NVLink utilization
nvidia-smi nvlink -g 0

NVLink vs. PCIe Use Cases

Use Case              | Best Interconnect
----------------------|------------------
Single GPU inference  | PCIe (sufficient)
Multi-GPU training    | NVLink (essential)
Large model inference | NVLink (model sharding)
Consumer workstation  | PCIe (NVLink limited)
Data center          | NVLink + InfiniBand

NVLink is essential infrastructure for multi-GPU AI — without high-bandwidth interconnects, scaling to multiple GPUs becomes inefficient as communication overhead dominates, making NVLink critical for training large models and serving them across GPU clusters.

nvlinkpcieinterconnectbandwidthgpunvswitchnccl

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.