Ring AllReduce

Ring AllReduce is the bandwidth-optimal collective communication algorithm that reduces and broadcasts data across N workers by passing partial results around a logical ring — using exactly 2(N-1)/N of the minimum possible bandwidth and scaling independently of the number of workers, making it the dominant algorithm for synchronizing gradients in data-parallel deep learning training.

The AllReduce Problem

- N workers each hold a vector of size S.
- Goal: Every worker ends up with the element-wise sum (or average) of all N vectors.
- Naive approach (send all to one, reduce, broadcast): Bottlenecked on single node — O(N×S) at root.

Ring AllReduce — Two Phases

Phase 1: Reduce-Scatter (N-1 steps)
1. Divide each worker's data into N chunks.
2. Step 1: Worker i sends chunk i to worker (i+1) mod N; receives chunk (i-1) from worker (i-1) mod N; accumulates (adds) received chunk.
3. Repeat N-1 times: Each step, a different chunk moves around the ring, accumulating partial sums.
4. After N-1 steps: Worker i holds the fully reduced chunk i.

Phase 2: AllGather (N-1 steps)
1. Same ring pattern, but now workers send their fully-reduced chunk around.
2. After N-1 steps: Every worker has all N fully-reduced chunks = complete AllReduce result.

Bandwidth Analysis

| Algorithm | Data Transferred (per worker) | Latency (steps) |
|-----------|------------------------------|----------------|
| Naive (tree) | 2S | 2 log₂(N) |
| Ring AllReduce | 2S × (N-1)/N | 2(N-1) |
| Recursive Halving-Doubling | 2S | 2 log₂(N) |

- Ring is bandwidth-optimal: Each link carries exactly the minimum data required.
- Ring has high latency: 2(N-1) steps — worse than tree for small messages.
- For gradient sync (large messages, S >> 1GB): Bandwidth dominates → Ring wins.

Implementation in Practice

- NCCL (NVIDIA): Implements ring AllReduce over NVLink and InfiniBand.
- Automatically selects ring vs. tree vs. recursive halving based on message size.
- For large messages (> 256KB): Ring AllReduce is default.
- Gloo (Meta): CPU-based ring AllReduce for PyTorch.
- Horovod: Originally popularized ring AllReduce for distributed deep learning.

Ring AllReduce for Gradient Sync

- Each GPU computes gradients locally → ring AllReduce averages gradients across all GPUs.
- With 8 GPUs and 1GB of gradients:
- Each GPU sends/receives: $2 \times 1GB \times \frac{7}{8} = 1.75GB$ total.
- Over NVLink (300 GB/s bidirectional): ~6 ms.
- This is overlapped with backward computation → nearly free.

Ring AllReduce is the algorithm that enabled efficient multi-GPU deep learning — its bandwidth-optimal scaling means that adding more GPUs for data-parallel training incurs minimal communication overhead, directly enabling the large-scale training runs behind modern language models.

Want to learn more?