Home Knowledge Base GPU Cluster Networking and HPC Fabric

GPU Cluster Networking and HPC Fabric is the high-speed interconnect infrastructure that connects hundreds to tens of thousands of GPU nodes in AI training clusters and HPC systems, determining how efficiently computation and communication overlap during distributed workloads — where the network is often the bottleneck rather than compute. At scale (1000+ GPUs), the collective communication operations (AllReduce, AllToAll) required by distributed deep learning spend 30–60% of total training time in network operations, making fabric topology, bandwidth, and latency directly responsible for training throughput.

Network Technologies Comparison

TechnologyBandwidth/PortLatencyDistanceUse Case
InfiniBand HDR200 Gb/s0.6 µsDatacenterHPC, AI training
InfiniBand NDR400 Gb/s0.5 µsDatacenterLarge AI clusters
RoCE v2100–400 Gb/s1–3 µsDatacenterAI, cloud GPU
NVLink600–900 GB/s<1 µsWithin nodeGPU-GPU within server
Ethernet (standard)100–400 Gb/s5–50 µsWAN/LANGeneral networking

RDMA and RoCE

Fabric Topologies

Fat-Tree (Most Common)

    [Core switches]
         / | \
  [Agg switches]   (aggregate layer)
       / | \
   [Leaf switches]  (rack-level)
       | | |
    [GPU nodes]     (servers)

Dragonfly+

Torus (3D)

Adaptive Routing

Collective Communication Algorithms

Network Congestion Control

GPU Cluster Scale Examples

ClusterGPU CountNetworkTopology
Meta RSC16,000 GPU200 GbE RoCEFat-tree
NVIDIA DGX SuperPOD4,096 GPU400 Gb InfiniBandFat-tree
Google TPU v4 Pod4,096 TPUOptical 3D torus3D torus
Microsoft Azure NDv4100–1000s GPU200 Gb InfiniBandFat-tree

GPU cluster networking is the circulatory system of modern AI — as model sizes grow from billions to trillions of parameters and training runs require thousands of GPUs running for weeks, the fabric that connects them determines whether those GPUs collaborate efficiently or spend most of their time waiting for gradients, making network architecture, bandwidth, and latency as critical to AI training throughput as the GPU compute itself.

gpu cluster networkinghigh performance networkingroceadaptive routingfabric topologyhpc networking

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.