HPC Cluster Networking enables extreme-scale distributed computation through high-bandwidth, low-latency interconnects like InfiniBand and RoCE, with RDMA verbs API providing efficient point-to-point and collective communication.
InfiniBand Generations (HDR, EDR, NDR)
- InfiniBand Bandwidth Evolution: SDR (2.5 Gbps) โ DDR (5 Gbps) โ QDR (40 Gbps) โ FDR (54.5 Gbps) โ EDR (100 Gbps) โ HDR (200 Gbps) โ NDR (400 Gbps).
- EDR (Enhanced Data Rate): 100 Gbps per link (12x lanes ร 12.5 Gbps effective). Dual-port NICs provide 200 Gbps aggregate. Typical for TOP500 clusters <2021.
- HDR (High Data Rate): 200 Gbps per link (12x lanes ร 16.67 Gbps). Dual-port = 400 Gbps. Emerging in latest supercomputers (Fugaku, Summit).
- Lane Count: All modern InfiniBand uses 12 lanes (full width). Older variants (1x/4x/8x) available for backward compatibility.
RDMA Verbs API and Queue Pairs
- RC (Reliable Connected): Point-to-point reliable delivery with ordering. Creates connection between two endpoints. Typical for send/recv, small-message optimization.
- UD (Unreliable Datagram): Connectionless, datagram semantics. No in-order delivery; lost datagrams not retransmitted. Lower overhead for all-to-all collectives.
- Queue Pair (QP): Endpoint consisting of Send Queue (SQ) and Recv Queue (RQ). Application posts work requests (WRs) to queues; hardware executes asynchronously.
- Completion Queue (CQ): Collects completed work. Application polls/waits on CQ to detect completion. Decouples WR submission from completion detection.
RoCE (RDMA over Converged Ethernet) v2
- RoCE Protocol: RDMA over Ethernet using InfiniBand Transport Layer. UDP/IP encapsulation enables Ethernet deployment without new hardware.
- RoCE v2: Uses IP, routable across switches (vs RoCE v1 link-local only). Rate-limited per flow via UDP source port hashing.
- Congestion Control (DCQCN): Data Center QCN algorithm detects congestion (explicit congestion notification from switches), throttles sender. Reduces packet loss.
- Switch Requirements: RoCE requires ECN-capable switches. Not all enterprise switches support ECN marking.
IB Queue Pair States and Transitions
- RESET โ INIT โ RTR (Ready to Receive): Initial connection setup. Exchange queue pair numbers, PSN (packet sequence number).
- RTR โ RTS (Ready to Send): Sender transitions to RTS. Now both sides can exchange data. Must be coordinated (RST โ RTS โ RTS both sides).
- RTS โ SQERROR: Send queue error (postdated WR, QP disabled). Application must recover by resetting QP.
- Connection Semantics: After connection establishment, sender/receiver can exchange messages in-order and reliably (bit-error rate ~1e-15 due to CRC protection).
Adaptive Routing and Switch Topology
- Deterministic Routing: Fixed path selection (up/down routing). Simple, loop-free but may not use all available bandwidth.
- Adaptive Routing: Path dynamically selected based on network congestion. Balances load across paths, improves bisection bandwidth. Requires more processing.
- Network Topology Options: Fat-tree (Clos network) most common. Dragonfly (Cray) alternative offering higher radix, lower hop count for large clusters.
Fat-Tree and Dragonfly Topologies
- Fat-Tree: Tree with uniform bandwidth at each level (no bandwidth bottleneck). Level 0 = hosts, Level 1 = edge switches, Level 2+ = core switches. Bisection bandwidth = (number_of_hosts ร link_bandwidth) / 2.
- Dragonfly: Hierarchical ring + full mesh + spine. Groups of hosts connected locally (ring), inter-group via spine (full mesh). Excellent for all-to-all, lower radix than fat-tree.
- Switch Radix: Fat-tree requires high-radix switches (256+ ports). Dragonfly lower radix (48-128 ports typical) reducing switch cost.
- Scaling: Fat-tree suitable up to 10,000 nodes; beyond that, Dragonfly preferred.
Performance Characteristics
- Latency: RDMA latency ~1-2ยตs (hardware offload). TCP/IP latency ~10-100ยตs (kernel processing). 10-100x difference critical for synchronized algorithms.
- Bandwidth: Link bandwidth fully utilized (>95%) for streaming loads. Point-to-point utilization high (message matching overhead minimal).
- Injection Bandwidth: Peak injection = (number of NIC ports) ร (link bandwidth). Typical HPC node: 2ร100Gbps = 200Gbps injection.
MPI over RDMA Performance
- Rendezvous Protocol: Small messages sent as eager (buffer preposted). Large messages use rendezvous (sender waits for receiver prepost). Threshold ~64KB-1MB depending on MPI implementation.
- Collective Optimization: All-reduce implemented via tree (minimize latency) or ring (maximize bandwidth). InfiniBand topology determines optimal algorithm.
- Bandwidth Saturation: Typical HPC application saturates InfiniBand in parallel regions (synchronous collectives). Asynchronous computation/communication hides latency.