GPU Cluster Networking and HPC Fabric is the high-speed interconnect infrastructure that connects hundreds to tens of thousands of GPU nodes in AI training clusters and HPC systems, determining how efficiently computation and communication overlap during distributed workloads โ where the network is often the bottleneck rather than compute. At scale (1000+ GPUs), the collective communication operations (AllReduce, AllToAll) required by distributed deep learning spend 30โ60% of total training time in network operations, making fabric topology, bandwidth, and latency directly responsible for training throughput.
Network Technologies Comparison
| Technology | Bandwidth/Port | Latency | Distance | Use Case |
|-----------|---------------|---------|----------|----------|
| InfiniBand HDR | 200 Gb/s | 0.6 ยตs | Datacenter | HPC, AI training |
| InfiniBand NDR | 400 Gb/s | 0.5 ยตs | Datacenter | Large AI clusters |
| RoCE v2 | 100โ400 Gb/s | 1โ3 ยตs | Datacenter | AI, cloud GPU |
| NVLink | 600โ900 GB/s | <1 ยตs | Within node | GPU-GPU within server |
| Ethernet (standard) | 100โ400 Gb/s | 5โ50 ยตs | WAN/LAN | General networking |
RDMA and RoCE
- RDMA (Remote Direct Memory Access): Transfer data directly between GPU memory on different nodes without CPU involvement.
- RoCE (RDMA over Converged Ethernet): RDMA protocol over standard Ethernet infrastructure โ cheaper hardware than InfiniBand while approaching InfiniBand latency.
- RDMA advantage: Eliminates CPU + OS overhead for network transfers โ latency drops from 50 ยตs (TCP) to 1โ3 ยตs (RoCE).
- Key use: AllReduce operations in PyTorch DDP, DeepSpeed โ reduce synchronization overhead.
Fabric Topologies
Fat-Tree (Most Common)
````
[Core switches]
/ | \
[Agg switches] (aggregate layer)
/ | \
[Leaf switches] (rack-level)
| | |
[GPU nodes] (servers)
- Full bisection bandwidth: Any server can communicate at full speed with any other.
- Scalable: Adding spine switches scales bandwidth.
- Used by: Meta, Microsoft, Google GPU clusters.
Dragonfly+
- All-to-all connections between groups of switches โ fewer hops across large clusters.
- Lower average hop count than fat-tree โ lower latency at scale.
- Trade-off: More complex routing, potentially non-uniform bandwidth.
Torus (3D)
- Grid topology with wrap-around connections โ each node connects to 6 neighbors.
- Used by: IBM Blue Gene, Google TPU v4 pods.
- Advantage: Good for nearest-neighbor communication patterns (physics simulations, LLM pipeline parallelism).
Adaptive Routing
- Static routing: Each flow takes one fixed path โ susceptible to congestion hotspots.
- Adaptive routing: Packets dynamically choose path based on link congestion โ avoids hotspots.
- ECMP (Equal-Cost Multi-Path): Traffic hashed across multiple equal-cost paths โ better load distribution.
- Hardware adaptive routing (InfiniBand HDR): Per-packet adaptive routing โ reorders packets โ receiver must handle reordering.
Collective Communication Algorithms
- Ring AllReduce: Each GPU sends to next โ reduces in ring โ N steps for N GPUs โ bandwidth efficient at scale.
- Tree AllReduce: Binary tree reduction โ log(N) steps โ faster for small messages.
- Recursive halving/doubling: Combines both โ good for mid-size clusters.
- AllToAll: Each GPU sends different data to every other GPU โ tensor parallelism โ fabric pattern is permutation โ hard on topology.
Network Congestion Control
- DCQCN (Data Center Quantized Congestion Notification): RoCE congestion control โ ECN marking + rate reduction.
- InfiniBand credit-based flow control: Prevents packet drop โ guaranteed delivery.
- Priority flow control (PFC): Pause specific traffic classes โ prevent head-of-line blocking.
GPU Cluster Scale Examples
| Cluster | GPU Count | Network | Topology |
|---------|----------|---------|----------|
| Meta RSC | 16,000 GPU | 200 GbE RoCE | Fat-tree |
| NVIDIA DGX SuperPOD | 4,096 GPU | 400 Gb InfiniBand | Fat-tree |
| Google TPU v4 Pod | 4,096 TPU | Optical 3D torus | 3D torus |
| Microsoft Azure NDv4 | 100โ1000s GPU | 200 Gb InfiniBand | Fat-tree |
GPU cluster networking is the circulatory system of modern AI โ as model sizes grow from billions to trillions of parameters and training runs require thousands of GPUs running for weeks, the fabric that connects them determines whether those GPUs collaborate efficiently or spend most of their time waiting for gradients, making network architecture, bandwidth, and latency as critical to AI training throughput as the GPU compute itself.