RDMA Programming

RDMA Programming is the paradigm of direct memory access between remote systems without CPU or OS involvement — enabling applications to read from or write to remote memory with sub-microsecond latency and near-zero CPU overhead by offloading data transfer to specialized network hardware, fundamentally changing the performance characteristics of distributed systems from CPU-bound to network-bound.

RDMA Operation Types:
- RDMA Write: local application writes data directly to remote memory; remote CPU is not notified or interrupted; one-sided operation requires only the initiator to be involved; typical use: pushing gradient updates to parameter server without waking the server CPU
- RDMA Read: local application reads data from remote memory; remote CPU unaware of the operation; higher latency than Write (requires round-trip for data return) but still <2μs; use case: fetching model parameters from remote GPU memory during distributed inference
- RDMA Send/Receive: two-sided operation requiring both sender and receiver to post matching operations; receiver must pre-post Receive buffers; provides message boundaries and ordering guarantees; used when receiver needs notification of incoming data
- RDMA Atomic: atomic compare-and-swap or fetch-and-add on remote memory; enables lock-free distributed data structures; critical for parameter server implementations where multiple workers atomically update shared parameters

Memory Registration and Protection:
- Registration Process: application calls ibv_reg_mr() to register a memory region; kernel pins physical pages (prevents swapping), creates DMA mapping, and returns L_Key (local access) and R_Key (remote access); registration is expensive (microseconds per MB) — applications cache registrations
- Memory Windows: dynamic sub-regions of registered memory with separate R_Keys; enables fine-grained access control without re-registering entire buffers; Type 1 windows bound at creation, Type 2 windows bound dynamically via Bind operations
- Access Permissions: registration specifies allowed operations (Local Write, Remote Write, Remote Read, Remote Atomic); HCA enforces permissions in hardware; attempting unauthorized access generates error completion
- Deregistration: ibv_dereg_mr() unpins pages and invalidates keys; must ensure no outstanding RDMA operations reference the region; improper deregistration causes segmentation faults or data corruption

Programming Model:
- Queue Pair Setup: create QP with ibv_create_qp(), transition through states (RESET → INIT → RTR → RTS) using ibv_modify_qp(); exchange QP numbers and GIDs with remote peer (out-of-band via TCP or shared file system)
- Posting Operations: construct Work Request (WR) with opcode (RDMA_WRITE, RDMA_READ, SEND), local buffer scatter-gather list, remote address/R_Key (for RDMA ops); call ibv_post_send() to submit WR to HCA; non-blocking call returns immediately
- Completion Polling: call ibv_poll_cq() to check Completion Queue for finished operations; CQE contains status (success/error), WR identifier, and byte count; polling is more efficient than event-driven for high-rate operations (avoids context switches)
- Signaling: not all WRs generate CQEs; applications set IBV_SEND_SIGNALED flag on periodic WRs (e.g., every 64th operation) to reduce CQ traffic; unsignaled WRs complete silently — application infers completion from signaled WR

Performance Optimization:
- Inline Data: small messages (<256 bytes) embedded directly in WR; avoids DMA setup overhead; reduces latency by 20-30% for small transfers; critical for latency-sensitive control messages
- Doorbell Batching: multiple WRs posted before ringing doorbell (writing to HCA MMIO register); amortizes doorbell cost across operations; improves throughput by 2-3× for small messages
- Selective Signaling: only signal every Nth operation to reduce CQ contention; application tracks outstanding unsignaled operations; must signal before QP runs out of send queue slots
- Memory Alignment: align buffers to cache line boundaries (64 bytes); prevents false sharing and improves DMA efficiency; misaligned buffers can reduce bandwidth by 10-15%

Common Patterns:
- Rendezvous Protocol: sender sends small notification via Send/Recv; receiver responds with RDMA Write permission (address + R_Key); sender performs RDMA Write of large payload; avoids receiver buffer exhaustion from unexpected large messages
- Circular Buffers: pre-registered ring buffer for streaming data; producer RDMA Writes to next slot, consumer polls for new data; eliminates per-message registration overhead; requires careful synchronization to prevent overwrites
- Aggregation Buffers: batch small updates into larger RDMA operations; reduces per-operation overhead; trade-off between latency (waiting for batch to fill) and efficiency (fewer operations)
- Persistent Connections: maintain QPs across multiple operations; connection setup (QP state transitions, address exchange) is expensive (milliseconds); amortize over thousands of operations

Error Handling:
- Completion Errors: WR failures generate error CQEs with status codes (remote access error, transport retry exceeded, local protection error); application must drain QP and reset to recover
- Timeout and Retry: HCA automatically retries lost packets; configurable timeout and retry count; excessive retries indicate network congestion or remote failure
- QP State Machine: errors transition QP to ERROR state; must drain outstanding WRs, then reset QP to RESET state before reuse; improper error handling leaves QP in unusable state

RDMA programming is the low-level foundation that enables high-performance distributed systems — by eliminating CPU overhead and achieving sub-microsecond latency, RDMA transforms the economics of distributed computing, making communication so cheap that entirely new architectures (disaggregated memory, remote GPU access, distributed shared memory) become practical.

Want to learn more?