RDMA and InfiniBand Programming

RDMA and InfiniBand Programming is the practice of using Remote Direct Memory Access (RDMA) technology to transfer data directly between the memory of two computers without involving the operating system or CPU of either machine on the data path — RDMA achieves sub-microsecond latency and near-line-rate bandwidth (up to 400 Gbps with HDR InfiniBand), making it essential for high-performance computing, distributed storage, and large-scale AI training.

RDMA Fundamentals:
- Zero-Copy Transfer: data moves directly from the sending application's memory buffer to the receiving application's memory buffer via the network adapter (RNIC) — no intermediate copies through kernel buffers, eliminating CPU overhead and memory bandwidth waste
- Kernel Bypass: RDMA operations are posted from user space directly to the RNIC hardware via memory-mapped I/O — the OS kernel is not involved in the data path, reducing per-message CPU overhead to <1 µs
- One-Sided Operations: RDMA Read and Write transfer data to/from remote memory without any CPU involvement at the remote side — the remote process doesn't even know its memory was accessed, enabling truly asynchronous communication
- Two-Sided Operations: Send/Receive involves both sides — the sender posts a send work request and the receiver posts a receive work request, similar to traditional message passing but with RDMA performance

InfiniBand Architecture:
- Speed Tiers: SDR (10 Gbps), DDR (20 Gbps), QDR (40 Gbps), FDR (56 Gbps), EDR (100 Gbps), HDR (200 Gbps), NDR (400 Gbps) — per-port bandwidth doubles roughly every 3 years
- Subnet Architecture: hosts connect through Host Channel Adapters (HCAs) via switches — subnet manager configures routing tables, LID assignments, and partition membership
- Reliable Connected (RC): the most common transport — establishes a reliable, ordered, connection-oriented channel between two Queue Pairs (similar to TCP but in hardware)
- Unreliable Datagram (UD): connectionless transport allowing one Queue Pair to communicate with any other — lower overhead but no reliability guarantees, limited to MTU-sized messages

Verbs API (libibverbs):
- Protection Domain: ibv_alloc_pd() creates an isolation boundary for RDMA resources — all memory regions and queue pairs must belong to a protection domain
- Memory Registration: ibv_reg_mr() pins physical memory pages and provides the RNIC with a translation table — registered memory can't be swapped out, and the RNIC accesses it without CPU involvement
- Queue Pair (QP): ibv_create_qp() creates a send/receive queue pair — work requests are posted to the send queue (ibv_post_send) or receive queue (ibv_post_recv) for the RNIC to process
- Completion Queue (CQ): ibv_create_cq() creates a queue where the RNIC posts completion notifications — ibv_poll_cq() retrieves completed work requests, enabling polling-based low-latency processing

RDMA Operations:
- RDMA Write: ibv_post_send with IBV_WR_RDMA_WRITE — transfers data from local buffer to a specified remote memory address without remote CPU involvement — requires knowing the remote address and rkey
- RDMA Read: ibv_post_send with IBV_WR_RDMA_READ — fetches data from remote memory into a local buffer — enables pull-based data access patterns
- Atomic Operations: IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD — perform atomic compare-and-swap or fetch-and-add on remote memory — enables distributed lock-free data structures
- Send/Receive: traditional two-sided messaging — receiver must pre-post receive buffers, sender's data is placed in the first available receive buffer — simpler programming model but requires CPU involvement on both sides

Performance Optimization:
- Doorbell Batching: post multiple work requests before ringing the doorbell (MMIO write to RNIC) — reduces MMIO overhead from one per request to one per batch
- Inline Sends: small messages (<64 bytes) can be inlined in the work request descriptor — eliminates a DMA read by the RNIC, reducing small-message latency by 200-400 ns
- Selective Signaling: request completion notification only every Nth work request — reduces CQ polling overhead and RNIC completion processing by N×
- Shared Receive Queue (SRQ): multiple QPs share a single receive buffer pool — reduces per-connection memory overhead from O(connections × buffers) to O(total_buffers)

RDMA is the networking technology that makes modern AI supercomputers possible — NVIDIA's DGX SuperPOD clusters use InfiniBand RDMA to connect thousands of GPUs with the low latency and high bandwidth needed for efficient distributed training of models with hundreds of billions of parameters.

Want to learn more?