adam

AdamW is a variant of the Adam optimizer that implements weight decay correctly by decoupling it from the gradient-based update, fixing a subtle but significant bug in the original Adam optimizer's handling of L2 regularization and becoming the standard optimizer for training transformer-based language models. The issue was identified by Loshchilov and Hutter (2019): in standard Adam, L2 regularization (adding λ||θ||² to the loss) interacts poorly with Adam's adaptive learning rates because the regularization gradient (2λθ) is scaled by Adam's per-parameter learning rate adjustments, meaning parameters with larger historical gradients (hence smaller effective learning rates) receive less regularization — violating the intent of uniform weight decay. AdamW fixes this by applying weight decay directly to the parameter update rather than through the loss gradient: θ_t = θ_{t-1} - α(m̂_t / (√v̂_t + ε) + λθ_{t-1}), where the weight decay term λθ_{t-1} is added after the Adam update rather than being incorporated into the gradient. This seemingly minor change produces meaningful improvements in generalization, especially for models trained with longer schedules. The update rule: compute first moment estimate m_t = β₁m_{t-1} + (1-β₁)g_t, second moment estimate v_t = β₂v_{t-1} + (1-β₂)g_t², compute bias-corrected estimates m̂_t and v̂_t, then update θ_t = θ_{t-1} - α(m̂_t / (√v̂_t + ε)) - αλθ_{t-1}. Default hyperparameters typically used: learning rate α = 1e-4 to 3e-4, β₁ = 0.9, β₂ = 0.999 (or 0.95 for LLM training), ε = 1e-8, and weight decay λ = 0.01 to 0.1. AdamW has become the default optimizer for virtually all large language model training (GPT, LLaMA, BERT, T5), typically combined with learning rate warmup (linear warmup for 1-5% of training) followed by cosine or linear decay scheduling.

Want to learn more?