Activation Checkpointing (Gradient Checkpointing)

Activation Checkpointing (Gradient Checkpointing) is a fundamental memory optimization technique universally employed in large-scale neural network training that trades a modest increase in computation time (approximately $30\%$) for a dramatic reduction in peak GPU memory consumption ($50\%$ to $70\%$) — by deliberately discarding intermediate layer activations during the forward pass and recomputing them on-the-fly during the backward pass only when they are needed for gradient calculation.

The Memory Wall

- The Standard Forward Pass: During a standard forward pass through a network with $L$ layers, every layer's output activation tensor must be preserved in GPU memory because the backpropagation algorithm requires these activations to compute the gradients. For a model with $L = 96$ layers, all $96$ activation tensors are simultaneously resident in memory.
- The Scaling Crisis: Memory consumption scales linearly with depth: $O(L)$. For a GPT-3 scale model ($96$ layers, batch size $32$, sequence length $2048$), the stored activations alone consume tens of gigabytes — far exceeding the parameter memory footprint and representing the dominant memory bottleneck.

The Checkpointing Strategy

Instead of storing every layer's activations, the algorithm designates only $sqrt{L}$ evenly spaced layers as "checkpoints" and stores only their activation tensors. All intermediate activations between checkpoints are immediately discarded after the forward pass.

1. Forward Pass: Compute all $L$ layers normally. Store activations only at every $sqrt{L}$-th layer. Discard all others.
2. Backward Pass: When backpropagation reaches a segment between two checkpoints, the algorithm re-executes the forward pass for that segment (starting from the nearest stored checkpoint activation) to recompute the discarded intermediate activations on-the-fly.
3. Memory Reduction: Memory consumption drops from $O(L)$ to $O(sqrt{L})$. For a $96$-layer model, this means storing $sim 10$ checkpoint activations instead of $96$ — approximately a $10 imes$ memory reduction.

The Computational Cost

Each activation between two checkpoints is computed twice — once during the original forward pass and once during the recomputation in the backward pass. This results in approximately $33\%$ additional FLOPs (one extra forward pass for each of $L/sqrt{L} = sqrt{L}$ segments).

The Practical Impact

Activation Checkpointing is the critical enabling technique that makes training 13B+ parameter models feasible on $8 imes$ A100 40GB clusters. Without it, the activation memory alone would overflow the GPU. It is a standard, non-negotiable component of every production LLM and large ViT training pipeline (Megatron-LM, DeepSpeed, PyTorch FSDP).

Activation Checkpointing is the rewind-and-replay memory trick — deliberately forgetting the intermediate computational steps and accepting the cost of redoing them on demand, trading a fraction of extra time for a massive liberation of precious GPU memory.

Activation Checkpointing (Gradient Checkpointing)

Want to learn more?