Gradient clipping limits gradient magnitude during training to prevent exploding gradients, stabilizing optimization of deep networks and recurrent architectures. Methods: (1) clip-by-value: clamp each gradient element to [-threshold, threshold], (2) clip-by-norm (most common): if ||g|| > max_norm, scale g → g × max_norm/||g||, preserving direction. Typical values: max_norm = 1.0 for transformers, 0.25-5.0 depending on architecture. Why needed: deep networks and RNNs can have gradient norms grow exponentially through layers (exploding gradients), causing divergence or NaN losses. When to use: LLM training (standard practice), RNN/LSTM training, fine-tuning with high learning rates, and unstable training regimes. Implementation: PyTorch torch.nn.utils.clip_grad_norm_, TensorFlow tf.clip_by_global_norm. Monitoring: log gradient norms to detect instability—sudden spikes indicate need for clipping. Trade-off: too aggressive clipping slows convergence (effectively reduces learning rate). Complements other stabilization techniques: learning rate warmup, weight decay, and normalization layers.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.