Communication-Computation Overlap is the technique of executing data transfers concurrently with useful computation β hiding the latency of inter-GPU, inter-node, or device-host communication behind productive work, which is the single most important optimization for scaling distributed training and HPC applications efficiently across multiple devices.
Why Overlap Matters
- Without overlap: Total time = Compute + Communication (serial).
- With overlap: Total time = max(Compute, Communication).
- At scale (hundreds of GPUs): Communication can be 30-50% of total time β overlap recovers most of this.
Overlap Techniques in Distributed Training
1. Gradient AllReduce Overlap (DDP standard)
- Backward pass computes gradients layer by layer.
- As soon as a layer's gradient is ready β start AllReduce for that layer.
- While AllReduce runs β backward pass continues computing next layer's gradients.
- Result: AllReduce mostly hidden behind backward computation.
2. Prefetch Parameters (FSDP/ZeRO-3)
- FSDP must all-gather parameters before each layer's forward pass.
- Prefetch: Start all-gathering layer N+1 while computing layer N.
- Result: Communication for next layer overlaps with current layer's computation.
3. Pipeline Parallelism Overlap
- While microbatch K is in forward on stage N β microbatch K-1 is in backward on stage N.
- Different stages process different microbatches simultaneously.
- Pipeline fill/drain bubbles remain but steady-state achieves full overlap.
Implementation on GPUs
| Mechanism | GPU Support | Use Case |
|-----------|-----------|----------|
| CUDA Streams | All NVIDIA GPUs | Overlap kernel execution with memcpy |
| GPUDirect RDMA | IB + NVIDIA GPU | NIC reads GPU memory directly β no CPU copy |
| NCCL async ops | NCCL 2.x+ | Non-blocking collective operations |
| cudaMemcpyAsync | All | Async hostβdevice transfers |
CUDA Stream Overlap Pattern
- Stream 1: Compute kernel.
- Stream 2: Communication (NCCL AllReduce or memcpy).
- Both streams execute concurrently on different GPU hardware units.
- GPU has dedicated copy engines separate from compute SMs β true overlap.
Measuring Overlap Efficiency
- Overlap ratio: $\frac{T_{serial} - T_{overlapped}}{T_{comm}}$
- 100% = perfect overlap (all communication hidden).
- 0% = no overlap (fully serial).
- Profile with NVIDIA Nsight Systems: Visual timeline shows concurrent stream execution.
Challenges
- Data dependencies: Cannot prefetch too far ahead β limited by data flow order.
- Memory pressure: Prefetching requires buffering data β increases memory usage.
- Synchronization: Must ensure communication completes before result is needed.
Communication-computation overlap is the fundamental technique that makes distributed computing practical β without it, the communication overhead of multi-GPU and multi-node training would make scaling beyond a few devices economically infeasible.