HPC Cluster Networking

HPC Cluster Networking enables extreme-scale distributed computation through high-bandwidth, low-latency interconnects like InfiniBand and RoCE, with RDMA verbs API providing efficient point-to-point and collective communication.

InfiniBand Generations (HDR, EDR, NDR)

- InfiniBand Bandwidth Evolution: SDR (2.5 Gbps) → DDR (5 Gbps) → QDR (40 Gbps) → FDR (54.5 Gbps) → EDR (100 Gbps) → HDR (200 Gbps) → NDR (400 Gbps).
- EDR (Enhanced Data Rate): 100 Gbps per link (12x lanes × 12.5 Gbps effective). Dual-port NICs provide 200 Gbps aggregate. Typical for TOP500 clusters <2021.
- HDR (High Data Rate): 200 Gbps per link (12x lanes × 16.67 Gbps). Dual-port = 400 Gbps. Emerging in latest supercomputers (Fugaku, Summit).
- Lane Count: All modern InfiniBand uses 12 lanes (full width). Older variants (1x/4x/8x) available for backward compatibility.

RDMA Verbs API and Queue Pairs

- RC (Reliable Connected): Point-to-point reliable delivery with ordering. Creates connection between two endpoints. Typical for send/recv, small-message optimization.
- UD (Unreliable Datagram): Connectionless, datagram semantics. No in-order delivery; lost datagrams not retransmitted. Lower overhead for all-to-all collectives.
- Queue Pair (QP): Endpoint consisting of Send Queue (SQ) and Recv Queue (RQ). Application posts work requests (WRs) to queues; hardware executes asynchronously.
- Completion Queue (CQ): Collects completed work. Application polls/waits on CQ to detect completion. Decouples WR submission from completion detection.

RoCE (RDMA over Converged Ethernet) v2

- RoCE Protocol: RDMA over Ethernet using InfiniBand Transport Layer. UDP/IP encapsulation enables Ethernet deployment without new hardware.
- RoCE v2: Uses IP, routable across switches (vs RoCE v1 link-local only). Rate-limited per flow via UDP source port hashing.
- Congestion Control (DCQCN): Data Center QCN algorithm detects congestion (explicit congestion notification from switches), throttles sender. Reduces packet loss.
- Switch Requirements: RoCE requires ECN-capable switches. Not all enterprise switches support ECN marking.

IB Queue Pair States and Transitions

- RESET → INIT → RTR (Ready to Receive): Initial connection setup. Exchange queue pair numbers, PSN (packet sequence number).
- RTR → RTS (Ready to Send): Sender transitions to RTS. Now both sides can exchange data. Must be coordinated (RST → RTS → RTS both sides).
- RTS → SQERROR: Send queue error (postdated WR, QP disabled). Application must recover by resetting QP.
- Connection Semantics: After connection establishment, sender/receiver can exchange messages in-order and reliably (bit-error rate ~1e-15 due to CRC protection).

Adaptive Routing and Switch Topology

- Deterministic Routing: Fixed path selection (up/down routing). Simple, loop-free but may not use all available bandwidth.
- Adaptive Routing: Path dynamically selected based on network congestion. Balances load across paths, improves bisection bandwidth. Requires more processing.
- Network Topology Options: Fat-tree (Clos network) most common. Dragonfly (Cray) alternative offering higher radix, lower hop count for large clusters.

Fat-Tree and Dragonfly Topologies

- Fat-Tree: Tree with uniform bandwidth at each level (no bandwidth bottleneck). Level 0 = hosts, Level 1 = edge switches, Level 2+ = core switches. Bisection bandwidth = (number_of_hosts × link_bandwidth) / 2.
- Dragonfly: Hierarchical ring + full mesh + spine. Groups of hosts connected locally (ring), inter-group via spine (full mesh). Excellent for all-to-all, lower radix than fat-tree.
- Switch Radix: Fat-tree requires high-radix switches (256+ ports). Dragonfly lower radix (48-128 ports typical) reducing switch cost.
- Scaling: Fat-tree suitable up to 10,000 nodes; beyond that, Dragonfly preferred.

Performance Characteristics

- Latency: RDMA latency ~1-2µs (hardware offload). TCP/IP latency ~10-100µs (kernel processing). 10-100x difference critical for synchronized algorithms.
- Bandwidth: Link bandwidth fully utilized (>95%) for streaming loads. Point-to-point utilization high (message matching overhead minimal).
- Injection Bandwidth: Peak injection = (number of NIC ports) × (link bandwidth). Typical HPC node: 2×100Gbps = 200Gbps injection.

MPI over RDMA Performance

- Rendezvous Protocol: Small messages sent as eager (buffer preposted). Large messages use rendezvous (sender waits for receiver prepost). Threshold ~64KB-1MB depending on MPI implementation.
- Collective Optimization: All-reduce implemented via tree (minimize latency) or ring (maximize bandwidth). InfiniBand topology determines optimal algorithm.
- Bandwidth Saturation: Typical HPC application saturates InfiniBand in parallel regions (synchronous collectives). Asynchronous computation/communication hides latency.

Want to learn more?