GPU Cluster Networking

GPU Cluster Networking is the high-bandwidth, low-latency interconnect infrastructure that enables thousands of GPUs to communicate efficiently during distributed training — utilizing specialized network fabrics like InfiniBand, RoCE, and proprietary interconnects (NVLink, Gaudi) to achieve the aggregate bandwidth and microsecond-level latency required for scaling deep learning workloads across hundreds of nodes without communication becoming the bottleneck.

Network Requirements for GPU Clusters:
- Bandwidth Scaling: modern GPUs (H100) deliver 2000 TFLOPS of compute; to maintain 50% communication efficiency in data-parallel training, network bandwidth must match GPU-to-GPU data transfer rates of 400-900 GB/s per node; 8-GPU nodes require 3.2-7.2 TB/s aggregate bisection bandwidth
- Latency Sensitivity: collective operations (all-reduce, all-gather) in distributed training are latency-bound for small message sizes; sub-microsecond network latency enables efficient gradient synchronization for models with many small layers; each microsecond of latency adds milliseconds to iteration time at scale
- Message Size Distribution: training workloads exhibit bimodal message patterns — large bulk transfers (multi-GB activation checkpoints, model states) benefit from bandwidth, while frequent small messages (gradient chunks, control signals) are latency-sensitive; network must optimize for both regimes
- Fault Tolerance: at 10,000+ GPU scale, hardware failures occur daily; network must support fast failure detection, traffic rerouting, and job migration without cascading failures that take down entire training runs

InfiniBand Architecture:
- RDMA Capabilities: Remote Direct Memory Access bypasses CPU and OS kernel, enabling GPU-to-GPU transfers with <1μs latency and near-line-rate bandwidth; RDMA read/write operations directly access remote GPU memory without interrupting the remote CPU
- HDR/NDR InfiniBand: HDR (High Data Rate) provides 200 Gb/s per port (25 GB/s); NDR (Next Data Rate) delivers 400 Gb/s (50 GB/s); 8-port NDR switches provide 3.2 Tb/s aggregate bandwidth — sufficient for 8-16 H100 GPUs per switch
- Adaptive Routing: InfiniBand switches dynamically route packets across multiple paths to avoid congestion; improves effective bandwidth utilization by 20-40% compared to static routing in fat-tree topologies
- Congestion Control: credit-based flow control prevents packet loss; ECN (Explicit Congestion Notification) and PFC (Priority Flow Control) manage congestion without dropping packets — critical for RDMA which cannot tolerate packet loss

Alternative Network Technologies:
- RoCE (RDMA over Converged Ethernet): implements RDMA semantics over Ethernet; RoCEv2 uses UDP/IP for routing flexibility; requires lossless Ethernet (PFC, ECN) for reliability; 200/400 GbE RoCE competitive with InfiniBand at lower cost but higher latency (2-5μs vs <1μs)
- NVLink/NVSwitch: NVIDIA proprietary GPU-to-GPU interconnect; NVLink 4.0 provides 900 GB/s bidirectional per GPU (18 links × 25 GB/s each); NVSwitch enables full non-blocking connectivity among 8 GPUs in a node — intra-node bandwidth 10× higher than PCIe
- Gaudi Interconnect: Intel Gaudi accelerators integrate 24× 100 GbE RDMA ports directly on chip; eliminates separate NICs and enables flexible network topologies; each Gaudi chip is a network endpoint and router
- AWS EFA (Elastic Fabric Adapter): cloud-optimized RDMA network for EC2; provides OS-bypass, low-latency communication for distributed ML; abstracts underlying network hardware (custom ASICs) behind standard libfabric API

Network Topology Impact:
- Fat-Tree: most common datacenter topology; full bisection bandwidth between any two nodes; scales to 10,000+ nodes with 3-5 switch tiers; predictable performance but high switch count and cabling complexity
- Dragonfly: hierarchical topology with dense intra-group connectivity and sparse inter-group links; reduces switch count by 40% vs fat-tree; adaptive routing critical to avoid hotspots on inter-group links
- Torus/Mesh: direct node-to-node connections in 2D/3D grid; common in HPC (Cray, Fugaku supercomputer); lower diameter than fat-tree but non-uniform bandwidth (edge nodes have fewer links); requires topology-aware job placement

GPU cluster networking is the critical infrastructure that determines whether distributed training scales efficiently or stalls on communication — the combination of RDMA-capable fabrics, adaptive routing, and topology optimization enables training runs that would otherwise be impossible, making the difference between days and months for frontier model development.

Want to learn more?