Communication-Computation Overlap

Communication-Computation Overlap is the technique of executing data transfers concurrently with useful computation — hiding the latency of inter-GPU, inter-node, or device-host communication behind productive work, which is the single most important optimization for scaling distributed training and HPC applications efficiently across multiple devices.

Why Overlap Matters

- Without overlap: Total time = Compute + Communication (serial).
- With overlap: Total time = max(Compute, Communication).
- At scale (hundreds of GPUs): Communication can be 30-50% of total time → overlap recovers most of this.

Overlap Techniques in Distributed Training

1. Gradient AllReduce Overlap (DDP standard)
- Backward pass computes gradients layer by layer.
- As soon as a layer's gradient is ready → start AllReduce for that layer.
- While AllReduce runs → backward pass continues computing next layer's gradients.
- Result: AllReduce mostly hidden behind backward computation.

2. Prefetch Parameters (FSDP/ZeRO-3)
- FSDP must all-gather parameters before each layer's forward pass.
- Prefetch: Start all-gathering layer N+1 while computing layer N.
- Result: Communication for next layer overlaps with current layer's computation.

3. Pipeline Parallelism Overlap
- While microbatch K is in forward on stage N → microbatch K-1 is in backward on stage N.
- Different stages process different microbatches simultaneously.
- Pipeline fill/drain bubbles remain but steady-state achieves full overlap.

Implementation on GPUs

| Mechanism | GPU Support | Use Case |
|-----------|-----------|----------|
| CUDA Streams | All NVIDIA GPUs | Overlap kernel execution with memcpy |
| GPUDirect RDMA | IB + NVIDIA GPU | NIC reads GPU memory directly — no CPU copy |
| NCCL async ops | NCCL 2.x+ | Non-blocking collective operations |
| cudaMemcpyAsync | All | Async host↔device transfers |

CUDA Stream Overlap Pattern

- Stream 1: Compute kernel.
- Stream 2: Communication (NCCL AllReduce or memcpy).
- Both streams execute concurrently on different GPU hardware units.
- GPU has dedicated copy engines separate from compute SMs → true overlap.

Measuring Overlap Efficiency

- Overlap ratio: $\frac{T_{serial} - T_{overlapped}}{T_{comm}}$
- 100% = perfect overlap (all communication hidden).
- 0% = no overlap (fully serial).
- Profile with NVIDIA Nsight Systems: Visual timeline shows concurrent stream execution.

Challenges

- Data dependencies: Cannot prefetch too far ahead — limited by data flow order.
- Memory pressure: Prefetching requires buffering data → increases memory usage.
- Synchronization: Must ensure communication completes before result is needed.

Communication-computation overlap is the fundamental technique that makes distributed computing practical — without it, the communication overhead of multi-GPU and multi-node training would make scaling beyond a few devices economically infeasible.

Want to learn more?