Communication-Computation Overlap

Keywords: communication computation overlap,async comm,overlap transfer compute,latency hiding comm,pipeline communication

Communication-Computation Overlap is the technique of executing data transfers concurrently with useful computation β€” hiding the latency of inter-GPU, inter-node, or device-host communication behind productive work, which is the single most important optimization for scaling distributed training and HPC applications efficiently across multiple devices.

Why Overlap Matters

- Without overlap: Total time = Compute + Communication (serial).
- With overlap: Total time = max(Compute, Communication).
- At scale (hundreds of GPUs): Communication can be 30-50% of total time β†’ overlap recovers most of this.

Overlap Techniques in Distributed Training

1. Gradient AllReduce Overlap (DDP standard)
- Backward pass computes gradients layer by layer.
- As soon as a layer's gradient is ready β†’ start AllReduce for that layer.
- While AllReduce runs β†’ backward pass continues computing next layer's gradients.
- Result: AllReduce mostly hidden behind backward computation.

2. Prefetch Parameters (FSDP/ZeRO-3)
- FSDP must all-gather parameters before each layer's forward pass.
- Prefetch: Start all-gathering layer N+1 while computing layer N.
- Result: Communication for next layer overlaps with current layer's computation.

3. Pipeline Parallelism Overlap
- While microbatch K is in forward on stage N β†’ microbatch K-1 is in backward on stage N.
- Different stages process different microbatches simultaneously.
- Pipeline fill/drain bubbles remain but steady-state achieves full overlap.

Implementation on GPUs

| Mechanism | GPU Support | Use Case |
|-----------|-----------|----------|
| CUDA Streams | All NVIDIA GPUs | Overlap kernel execution with memcpy |
| GPUDirect RDMA | IB + NVIDIA GPU | NIC reads GPU memory directly β€” no CPU copy |
| NCCL async ops | NCCL 2.x+ | Non-blocking collective operations |
| cudaMemcpyAsync | All | Async host↔device transfers |

CUDA Stream Overlap Pattern

- Stream 1: Compute kernel.
- Stream 2: Communication (NCCL AllReduce or memcpy).
- Both streams execute concurrently on different GPU hardware units.
- GPU has dedicated copy engines separate from compute SMs β†’ true overlap.

Measuring Overlap Efficiency

- Overlap ratio: $\frac{T_{serial} - T_{overlapped}}{T_{comm}}$
- 100% = perfect overlap (all communication hidden).
- 0% = no overlap (fully serial).
- Profile with NVIDIA Nsight Systems: Visual timeline shows concurrent stream execution.

Challenges

- Data dependencies: Cannot prefetch too far ahead β€” limited by data flow order.
- Memory pressure: Prefetching requires buffering data β†’ increases memory usage.
- Synchronization: Must ensure communication completes before result is needed.

Communication-computation overlap is the fundamental technique that makes distributed computing practical β€” without it, the communication overhead of multi-GPU and multi-node training would make scaling beyond a few devices economically infeasible.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT