Home Knowledge Base Regularization Techniques

Regularization Techniques

Why Regularization? Regularization prevents overfitting by constraining model complexity, improving generalization to unseen data.

Dropout

How It Works Randomly set neurons to zero during training with probability $p$: $$ h' = \frac{1}{1-p} \cdot h \cdot mask $$

Scale by $\frac{1}{1-p}$ so expected value unchanged.

Typical Values

ComponentDropout Rate
Attention0.0-0.1
FFN0.0-0.1
Embedding0.0-0.1

Modern LLMs Most large LLMs (GPT-4, Llama) use minimal or no dropout:

Weight Decay

L2 Regularization Add penalty proportional to weight magnitude: $$ L_{total} = L_{task} + \lambda \sum_i w_i^2 $$

AdamW vs Adam

# AdamW (preferred)
optimizer = torch.optim.AdamW(params, lr=1e-4, weight_decay=0.01)

# What NOT to do: Adam with L2
# optimizer = torch.optim.Adam(params, lr=1e-4, weight_decay=0.01)

Typical Values

Layer Normalization Not strictly regularization, but improves training stability: $$ \hat{x} = \frac{x - \mu}{\sigma} \cdot \gamma + \beta $$

Data Augmentation For LLMs, augmentation includes:

Other Techniques

TechniqueDescription
Early stoppingStop when validation loss stops improving
Gradient clippingLimit gradient magnitude
Label smoothingSoften one-hot targets
Stochastic depthRandomly skip layers during training
regularizationdropoutweight decay

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.