Optimizers for Deep Learning
What is an Optimizer? An optimizer updates model parameters based on gradients to minimize the loss function.
Common Optimizers
SGD (Stochastic Gradient Descent) $$ \theta_{t+1} = \theta_t - \eta abla L(\theta_t) $$
SGD with Momentum $$ v_{t+1} = \gamma v_t + \eta abla L(\theta_t) $$ $$ \theta_{t+1} = \theta_t - v_{t+1} $$
Adam (Adaptive Moment Estimation) Most popular for LLMs. Maintains moving averages of gradient (m) and squared gradient (v):
$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$
Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$
AdamW (Adam with Weight Decay) Fixes weight decay in Adam. Preferred for LLM training: $$ \theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\theta_t\right) $$
Optimizer Comparison
| Optimizer | Memory | Convergence | Use Case |
|---|---|---|---|
| SGD | Low | Slow | Simple models, CV |
| Adam | 2x params | Fast | Most DL |
| AdamW | 2x params | Fast | LLM training |
| 8-bit Adam | Low | Fast | Memory-constrained |
| Adafactor | Low | Moderate | Large models |
Learning Rate
Typical Values
| Task | Learning Rate |
|---|---|
| Pretraining | 1e-4 to 3e-4 |
| Full fine-tuning | 1e-5 to 5e-5 |
| LoRA fine-tuning | 1e-4 to 3e-4 |
Learning Rate Schedules
- Constant: Fixed throughout training
- Linear decay: Linearly decrease to 0
- Cosine annealing: Smooth decay following cosine
- Warmup + decay: Start low, increase, then decay
PyTorch Example
import torch.optim as optim
# AdamW optimizer
optimizer = optim.AdamW(
model.parameters(),
lr=1e-4,
weight_decay=0.01,
betas=(0.9, 0.999),
)
# Cosine scheduler with warmup
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=num_steps
)
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.