Home Knowledge Base Ring AllReduce

Ring AllReduce is the bandwidth-optimal collective communication algorithm that reduces and broadcasts data across N workers by passing partial results around a logical ring — using exactly 2(N-1)/N of the minimum possible bandwidth and scaling independently of the number of workers, making it the dominant algorithm for synchronizing gradients in data-parallel deep learning training.

The AllReduce Problem

Ring AllReduce — Two Phases

Phase 1: Reduce-Scatter (N-1 steps) 1. Divide each worker's data into N chunks. 2. Step 1: Worker i sends chunk i to worker (i+1) mod N; receives chunk (i-1) from worker (i-1) mod N; accumulates (adds) received chunk. 3. Repeat N-1 times: Each step, a different chunk moves around the ring, accumulating partial sums. 4. After N-1 steps: Worker i holds the fully reduced chunk i.

Phase 2: AllGather (N-1 steps) 1. Same ring pattern, but now workers send their fully-reduced chunk around. 2. After N-1 steps: Every worker has all N fully-reduced chunks = complete AllReduce result.

Bandwidth Analysis

AlgorithmData Transferred (per worker)Latency (steps)
Naive (tree)2S2 log₂(N)
Ring AllReduce2S × (N-1)/N2(N-1)
Recursive Halving-Doubling2S2 log₂(N)

Implementation in Practice

Ring AllReduce for Gradient Sync

Ring AllReduce is the algorithm that enabled efficient multi-GPU deep learning — its bandwidth-optimal scaling means that adding more GPUs for data-parallel training incurs minimal communication overhead, directly enabling the large-scale training runs behind modern language models.

ring allreduce algorithmring communicationbandwidth optimal allreduceallreduce collective

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.