InfiniBand Architecture is the high-performance networking standard designed for low-latency, high-bandwidth interconnects in HPC and AI clusters â providing hardware-offloaded RDMA operations, reliable transport with sub-microsecond latency, and scalable switched fabric architecture that has become the de facto standard for GPU cluster networking in large-scale machine learning infrastructure.
InfiniBand Protocol Stack:
- Physical Layer: electrical signaling at 25-50 Gb/s per lane (SerDes technology); 4Ã or 12Ã lane aggregation produces 100-600 Gb/s links; copper cables (DAC) for <5m, active optical cables (AOC) for 5-100m, fiber optics for longer distances
- Link Layer: 2KB packet size with 8-bit CRC for error detection; credit-based flow control ensures lossless transmission; virtual lanes (up to 15 data VLs + 1 management VL) enable QoS and deadlock-free routing
- Network Layer: 128-bit Global Identifier (GID) addressing; subnet-based routing with LID (Local Identifier) for intra-subnet, GID for inter-subnet; supports IPv4/IPv6 encapsulation for WAN connectivity
- Transport Layer: multiple transport services â Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD); RC is most common for RDMA, providing in-order delivery with hardware-level retransmission
Queue Pair (QP) Model:
- Send/Receive Queues: each QP consists of a Send Queue (SQ) and Receive Queue (RQ); applications post Work Requests (WRs) to queues; HCA (Host Channel Adapter) processes WRs asynchronously and posts Completion Queue Entries (CQEs) when operations complete
- RDMA Operations: RDMA Write (write to remote memory without remote CPU involvement), RDMA Read (read from remote memory), RDMA Atomic (atomic compare-and-swap, fetch-and-add); Send/Receive for traditional message passing
- Memory Registration: applications register memory regions with the HCA, receiving an R_Key (remote key) and L_Key (local key); registration pins physical pages and grants HCA DMA access; remote peers use R_Key to access registered memory via RDMA operations
- Zero-Copy Transfer: data moves directly from application buffer to NIC to remote NIC to remote application buffer; CPU only posts the operation descriptor â no data copying through kernel buffers, achieving 95%+ of wire bandwidth
Subnet Management:
- Subnet Manager (SM): centralized control plane that discovers topology, assigns LIDs, computes routing tables, and configures switch forwarding; typically runs on a dedicated management node or integrated into a switch
- LID Assignment: SM assigns 16-bit LIDs to each port; unicast LIDs for point-to-point, multicast LIDs for one-to-many; LID Mask Control (LMC) enables multiple paths between endpoints for load balancing
- Routing Algorithms: SM computes forwarding tables using algorithms like Min-Hop (shortest path), DFSSSP (Deadlock-Free Single-Source Shortest Path), or Fat-Tree optimized routing; tables downloaded to switches via Subnet Management Packets (SMPs)
- Topology Discovery: SM sends SMP queries to discover switches, links, and endpoints; builds complete topology graph; reconfigures routing on link failures or topology changes; discovery and reconfiguration complete in seconds for 1000-node clusters
Performance Characteristics:
- Latency: RC Send/Receive latency <1Ξs for small messages (ConnectX-7); RDMA Write latency 0.6-0.8Ξs; latency dominated by HCA processing and wire time, not software overhead
- Bandwidth: NDR (400 Gb/s) achieves 48+ GB/s effective bandwidth for large messages; 95%+ efficiency due to hardware offload and zero-copy; multiple QPs enable full link utilization from concurrent operations
- CPU Efficiency: RDMA operations consume <5% CPU utilization at line rate; CPU freed for computation while network transfers proceed in background; critical for GPU workloads where CPU orchestrates GPU kernels
- Scalability: single subnet supports 48K endpoints (16-bit LID space); multi-subnet fabrics with routers scale to millions of endpoints; flat address space within subnet simplifies programming model
Programming Interfaces:
- Verbs API: low-level C API (libibverbs) for direct HCA access; applications create QPs, post WRs, poll CQs; maximum performance but complex programming model requiring careful resource management
- UCP/UCX: Unified Communication X library provides high-level abstractions (Active Messages, RMA, Atomics) over Verbs; automatic protocol selection, multi-rail support, and fault tolerance; used by MPI implementations and ML frameworks
- MPI over IB: MPI libraries (OpenMPI, MVAPICH, Intel MPI) implement MPI semantics using IB Verbs; MPI_Send/Recv map to IB Send/Recv or RDMA operations; collective operations optimized for IB hardware multicast and adaptive routing
- NCCL over IB: NVIDIA Collective Communications Library detects IB devices and uses RDMA for GPU-to-GPU transfers; implements ring, tree, and collnet algorithms optimized for IB topology; achieves 90%+ of theoretical bandwidth for all-reduce operations
InfiniBand architecture is the networking foundation of modern AI infrastructure â its hardware-offloaded RDMA, sub-microsecond latency, and lossless fabric enable the efficient distributed training of frontier models, making it the interconnect of choice for every major AI lab and cloud provider building GPU supercomputers.