Home Knowledge Base Optimizers for Deep Learning

Optimizers for Deep Learning

What is an Optimizer? An optimizer updates model parameters based on gradients to minimize the loss function.

Common Optimizers

SGD (Stochastic Gradient Descent) $$ \theta_{t+1} = \theta_t - \eta abla L(\theta_t) $$

SGD with Momentum $$ v_{t+1} = \gamma v_t + \eta abla L(\theta_t) $$ $$ \theta_{t+1} = \theta_t - v_{t+1} $$

Adam (Adaptive Moment Estimation) Most popular for LLMs. Maintains moving averages of gradient (m) and squared gradient (v):

$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$

Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$

AdamW (Adam with Weight Decay) Fixes weight decay in Adam. Preferred for LLM training: $$ \theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\theta_t\right) $$

Optimizer Comparison

OptimizerMemoryConvergenceUse Case
SGDLowSlowSimple models, CV
Adam2x paramsFastMost DL
AdamW2x paramsFastLLM training
8-bit AdamLowFastMemory-constrained
AdafactorLowModerateLarge models

Learning Rate

Typical Values

TaskLearning Rate
Pretraining1e-4 to 3e-4
Full fine-tuning1e-5 to 5e-5
LoRA fine-tuning1e-4 to 3e-4

Learning Rate Schedules

PyTorch Example

import torch.optim as optim

# AdamW optimizer
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-4,
    weight_decay=0.01,
    betas=(0.9, 0.999),
)

# Cosine scheduler with warmup
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=num_steps
)
optimizeradamlearning rateadamwoptimizer comparison

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.