gradient accumulation

Gradient accumulation simulates larger batch sizes by summing gradients over multiple forward/backward passes (micro-batches) before performing a single optimizer step, enabling training of large models on memory-constrained hardware. Memory constraint: batch size limited by GPU VRAM; large batches needed for stable convergence or BatchNorm. Method: (1) split desired batch B into N micro-batches of size B/N; (2) run forward/backward for micro-batch 1, keep computation graph for gradients but drop activations (unless check-pointing); (3) accumulate gradients in tensor; (4) repeat for N micro-batches; (5) optimizer.step() and zero_grad(). Trade-off: computation time increases (N steps vs 1) but peak memory is reduced to micro-batch size. Communication: in distributed training, reduce gradients (averaging) only after accumulation; reduces network overhead. Normalization: gradients must be divided by number of accumulation steps to keep scale consistent. Batch Normalization warning: BN statistics updated per micro-batch, not effective global batch; may need GroupNorm or SyncBatchNorm. Gradient accumulation decouples physical memory limits from algorithmic batch size requirements.

Want to learn more?