Regularization Techniques
Why Regularization? Regularization prevents overfitting by constraining model complexity, improving generalization to unseen data.
Dropout
How It Works Randomly set neurons to zero during training with probability $p$: $$ h' = \frac{1}{1-p} \cdot h \cdot mask $$
Scale by $\frac{1}{1-p}$ so expected value unchanged.
Typical Values
| Component | Dropout Rate |
|---|---|
| Attention | 0.0-0.1 |
| FFN | 0.0-0.1 |
| Embedding | 0.0-0.1 |
Modern LLMs Most large LLMs (GPT-4, Llama) use minimal or no dropout:
- Large models + enough data → less overfitting
- Dropout slows training
- Other regularization (data augmentation) preferred
Weight Decay
L2 Regularization Add penalty proportional to weight magnitude: $$ L_{total} = L_{task} + \lambda \sum_i w_i^2 $$
AdamW vs Adam
- Adam with L2: Suboptimal, couples regularization with adaptive LR
- AdamW: Decouples weight decay from gradient update (correct approach)
# AdamW (preferred)
optimizer = torch.optim.AdamW(params, lr=1e-4, weight_decay=0.01)
# What NOT to do: Adam with L2
# optimizer = torch.optim.Adam(params, lr=1e-4, weight_decay=0.01)
Typical Values
- Pretraining: weight_decay = 0.1
- Fine-tuning: weight_decay = 0.01
Layer Normalization Not strictly regularization, but improves training stability: $$ \hat{x} = \frac{x - \mu}{\sigma} \cdot \gamma + \beta $$
- Normalizes activations to zero mean, unit variance
- Learnable scale (γ) and shift (β)
- Pre-LN (before attention) is more stable for deep networks
Data Augmentation For LLMs, augmentation includes:
- Paraphrasing training examples
- Back-translation
- Token dropout/masking
- Mixing training examples
Other Techniques
| Technique | Description |
|---|---|
| Early stopping | Stop when validation loss stops improving |
| Gradient clipping | Limit gradient magnitude |
| Label smoothing | Soften one-hot targets |
| Stochastic depth | Randomly skip layers during training |
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.