The problem

Gradient clipping caps gradient magnitude to prevent exploding gradients that destabilize training. The problem: Large gradients cause huge weight updates, loss spikes, or NaN values. Common in RNNs, deep networks, and early training. Clipping methods: Clip by value: Clamp each gradient element to [-threshold, threshold]. Simple but can change gradient direction. Clip by norm: Scale gradient vector to max norm if larger. Preserves direction. More common. Clip by global norm: Compute norm across all parameters, scale uniformly. Recommended for most uses. Typical values: 1.0 is common, sometimes 0.5 or 5.0. Depends on model and optimizer. When to use: Always for RNNs/LSTMs, recommended for transformer training, useful for unstable training. Implementation: torch.nn.utils.clip_grad_norm_, tf.clip_by_global_norm. Usually called after backward, before optimizer.step. Relationship to loss scaling: With mixed precision, unscale gradients before clipping (or adjust threshold). Monitoring: Log gradient norms. Consistent clipping may indicate learning rate issues. Occasional clipping is fine.

Want to learn more?