InfiniBand Architecture

InfiniBand Architecture is the high-performance networking standard designed for low-latency, high-bandwidth interconnects in HPC and AI clusters — providing hardware-offloaded RDMA operations, reliable transport with sub-microsecond latency, and scalable switched fabric architecture that has become the de facto standard for GPU cluster networking in large-scale machine learning infrastructure.

InfiniBand Protocol Stack:
- Physical Layer: electrical signaling at 25-50 Gb/s per lane (SerDes technology); 4× or 12× lane aggregation produces 100-600 Gb/s links; copper cables (DAC) for <5m, active optical cables (AOC) for 5-100m, fiber optics for longer distances
- Link Layer: 2KB packet size with 8-bit CRC for error detection; credit-based flow control ensures lossless transmission; virtual lanes (up to 15 data VLs + 1 management VL) enable QoS and deadlock-free routing
- Network Layer: 128-bit Global Identifier (GID) addressing; subnet-based routing with LID (Local Identifier) for intra-subnet, GID for inter-subnet; supports IPv4/IPv6 encapsulation for WAN connectivity
- Transport Layer: multiple transport services — Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD); RC is most common for RDMA, providing in-order delivery with hardware-level retransmission

Queue Pair (QP) Model:
- Send/Receive Queues: each QP consists of a Send Queue (SQ) and Receive Queue (RQ); applications post Work Requests (WRs) to queues; HCA (Host Channel Adapter) processes WRs asynchronously and posts Completion Queue Entries (CQEs) when operations complete
- RDMA Operations: RDMA Write (write to remote memory without remote CPU involvement), RDMA Read (read from remote memory), RDMA Atomic (atomic compare-and-swap, fetch-and-add); Send/Receive for traditional message passing
- Memory Registration: applications register memory regions with the HCA, receiving an R_Key (remote key) and L_Key (local key); registration pins physical pages and grants HCA DMA access; remote peers use R_Key to access registered memory via RDMA operations
- Zero-Copy Transfer: data moves directly from application buffer to NIC to remote NIC to remote application buffer; CPU only posts the operation descriptor — no data copying through kernel buffers, achieving 95%+ of wire bandwidth

Subnet Management:
- Subnet Manager (SM): centralized control plane that discovers topology, assigns LIDs, computes routing tables, and configures switch forwarding; typically runs on a dedicated management node or integrated into a switch
- LID Assignment: SM assigns 16-bit LIDs to each port; unicast LIDs for point-to-point, multicast LIDs for one-to-many; LID Mask Control (LMC) enables multiple paths between endpoints for load balancing
- Routing Algorithms: SM computes forwarding tables using algorithms like Min-Hop (shortest path), DFSSSP (Deadlock-Free Single-Source Shortest Path), or Fat-Tree optimized routing; tables downloaded to switches via Subnet Management Packets (SMPs)
- Topology Discovery: SM sends SMP queries to discover switches, links, and endpoints; builds complete topology graph; reconfigures routing on link failures or topology changes; discovery and reconfiguration complete in seconds for 1000-node clusters

Performance Characteristics:
- Latency: RC Send/Receive latency <1μs for small messages (ConnectX-7); RDMA Write latency 0.6-0.8μs; latency dominated by HCA processing and wire time, not software overhead
- Bandwidth: NDR (400 Gb/s) achieves 48+ GB/s effective bandwidth for large messages; 95%+ efficiency due to hardware offload and zero-copy; multiple QPs enable full link utilization from concurrent operations
- CPU Efficiency: RDMA operations consume <5% CPU utilization at line rate; CPU freed for computation while network transfers proceed in background; critical for GPU workloads where CPU orchestrates GPU kernels
- Scalability: single subnet supports 48K endpoints (16-bit LID space); multi-subnet fabrics with routers scale to millions of endpoints; flat address space within subnet simplifies programming model

Programming Interfaces:
- Verbs API: low-level C API (libibverbs) for direct HCA access; applications create QPs, post WRs, poll CQs; maximum performance but complex programming model requiring careful resource management
- UCP/UCX: Unified Communication X library provides high-level abstractions (Active Messages, RMA, Atomics) over Verbs; automatic protocol selection, multi-rail support, and fault tolerance; used by MPI implementations and ML frameworks
- MPI over IB: MPI libraries (OpenMPI, MVAPICH, Intel MPI) implement MPI semantics using IB Verbs; MPI_Send/Recv map to IB Send/Recv or RDMA operations; collective operations optimized for IB hardware multicast and adaptive routing
- NCCL over IB: NVIDIA Collective Communications Library detects IB devices and uses RDMA for GPU-to-GPU transfers; implements ring, tree, and collnet algorithms optimized for IB topology; achieves 90%+ of theoretical bandwidth for all-reduce operations

InfiniBand architecture is the networking foundation of modern AI infrastructure — its hardware-offloaded RDMA, sub-microsecond latency, and lossless fabric enable the efficient distributed training of frontier models, making it the interconnect of choice for every major AI lab and cloud provider building GPU supercomputers.

Want to learn more?