InfiniBand is a high-bandwidth, low-latency networking technology using RDMA for GPU cluster communication — providing 200-400 Gbps per port with microsecond latencies, InfiniBand is the interconnect of choice for large-scale AI training where multi-node communication efficiency determines scaling effectiveness.
What Is InfiniBand?
- Definition: High-performance networking fabric for clusters.
- Technology: RDMA (Remote Direct Memory Access).
- Vendor: NVIDIA/Mellanox (dominant).
- Use Case: HPC, AI training, storage networks.
Why InfiniBand for AI
- Bandwidth: 400 Gbps (NDR) vs. 100 Gbps Ethernet.
- Latency: ~1 μs vs. ~10-50 μs Ethernet.
- RDMA: Bypass CPU for GPU-to-GPU transfers.
- Scaling: Efficient all-reduce across thousands of GPUs.
- Proven: Used in largest AI training runs.
InfiniBand Generations
Speed Evolution:
Generation | Speed (per port) | Year
-----------|------------------|------
EDR | 100 Gbps | 2014
HDR | 200 Gbps | 2019
NDR | 400 Gbps | 2022
XDR | 800 Gbps | 2024
GDR | 1600 Gbps | Future
Comparison with Ethernet:
Aspect | InfiniBand NDR | 400G Ethernet
--------------|----------------|---------------
Bandwidth | 400 Gbps | 400 Gbps
Latency | ~1 μs | ~10-50 μs
RDMA | Native | RoCE (extra)
Congestion | Credit-based | Drop-based
CPU overhead | Minimal | Higher
AI training | Optimized | Improving
Cost | Higher | Lower
RDMA Explained
How RDMA Works:
Traditional Network:
CPU → Copy to buffer → NIC → Network → NIC → Copy to buffer → CPU
RDMA:
GPU Memory → NIC → Network → NIC → GPU Memory
(CPU not involved, zero-copy)
GPU Direct RDMA:
┌─────────┐ NVLink ┌─────────┐
│ GPU 0 │◄────────────►│ GPU 1 │
└────┬────┘ └────┬────┘
│ PCIe │ PCIe
▼ ▼
┌─────────┐ InfiniBand ┌─────────┐
│ NIC │◄────────────►│ NIC │
└─────────┘ (RDMA) └─────────┘
GPU Direct: GPU memory directly accessed by NIC
No CPU involvement, minimal latency
AI Training Infrastructure
Typical Large Cluster:
┌─────────────────────────────────────────────────────────┐
│ Spine Switches │
│ (InfiniBand NDR, high-radix, non-blocking) │
└─────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐┌─────────┐┌─────────┐┌─────────┐
│ Leaf ││ Leaf ││ Leaf ││ Leaf │
│ Switch ││ Switch ││ Switch ││ Switch │
└────┬────┘└────┬────┘└────┬────┘└────┬────┘
│ │ │ │
┌─────┼─────┐ ... ... ...
│ │ │
┌──────┐┌──────┐┌──────┐
│DGX 1 ││DGX 2 ││DGX 3 │ (8 H100s each)
└──────┘└──────┘└──────┘
NCCL with InfiniBand
import torch
import torch.distributed as dist
import os
# Set NCCL environment for InfiniBand
os.environ["NCCL_IB_DISABLE"] = "0" # Enable InfiniBand
os.environ["NCCL_NET_GDR_LEVEL"] = "5" # Enable GPUDirect
# Initialize distributed
dist.init_process_group(
backend="nccl",
init_method="env://",
)
# Training code - NCCL uses InfiniBand automatically
model = DistributedDataParallel(model)
Checking InfiniBand
# List InfiniBand devices
ibstat
# Show port status
ibstatus
# Check link speed
ibstat mlx5_0 | grep Rate
# Performance test
ib_write_bw -d mlx5_0
InfiniBand vs. Alternatives
Use Case | Best Choice
----------------------|------------------
AI training (1000+ GPU) | InfiniBand NDR
Small clusters (<64 GPU)| Either (cost-dependent)
Cloud/flexibility | Ethernet (easier)
Maximum performance | InfiniBand
Budget constrained | 400G Ethernet + RoCE
Cost Considerations
Component | InfiniBand | 400G Ethernet
-------------------|------------|---------------
NIC/HCA | $3-5K | $1-2K
Switch (port) | $500-1K | $200-400
Total system cost | Higher | Lower
Performance/$ | Better at scale | Better for small
InfiniBand is the performance backbone of large-scale AI training — when training frontier models across thousands of GPUs, the efficiency of collective operations enabled by InfiniBand's low latency and RDMA capabilities directly determines how well training scales.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.