Learning Rate Warmup and Cosine Scheduling are complementary techniques that strategically adjust learning rates during training โ gradually increasing learning rate in warmup phase prevents gradient shock and poor weight initialization, while cosine annealing smoothly reduces learning rate to enable fine-grained optimization enabling both faster convergence and better final performance.
Learning Rate Warmup Phase:
- Linear Warmup: increasing learning rate from 0 to target_lr over warmup_steps (typically 1000-10000 steps) โ linear_lr(t) = target_lr ร (t / warmup_steps)
- Initialization Impact: with random weight initialization, early gradients large and noisy โ warmup prevents large updates that destabilize training
- Adam Optimizer Interaction: warmup especially important for Adam; without it, early adaptive learning rates become too aggressive
- Warmup Duration: typically 10% of training steps for smaller models, 5% for large models โ shorter warmup for well-initialized models
- BERT Standard: using 10K warmup steps over 100K total steps (10% ratio) โ consistent across BERT variants
Mathematical Formulation:
- Linear Warmup: lr(t) = min(t/warmup_steps, 1) ร base_lr for t โค warmup_steps
- Learning Rate at Step t: combines warmup with base schedule (e.g., cosine) applied to warmup-scaled values
- Gradient Impact: with warmup, gradient magnitudes typically 0.1-0.5 in early steps, increasing to 1.0-2.0 by warmup end
- Loss Curvature: warmup allows model to move into low-loss regions before aggressive optimization
Cosine Annealing Schedule:
- Formula: lr(t) = base_lr ร (1 + cos(ฯยทt/T))/2 where t is current step, T is total steps โ smooth decay from base_lr to โ0
- Characteristics: slow initial decay, faster mid-training, asymptotic approach to zero โ natural optimization progression
- Restart Schedules: periodic resets (warm restarts) enable escape from local minima โ "SGDR" schedule with periodic restarts
- Cosine vs Linear: cosine provides smoother gradients, avoiding sudden learning rate drops that cause optimization disruption
Training Curve Behavior:
- Warmup Phase (0-10K steps): loss decreases slowly (2-5% improvement per 1K steps), highly variable
- Main Training (10K-90K steps): rapid loss decrease (10-20% per 10K steps), smooth convergence trajectory
- Annealing Phase (90K-100K steps): fine-grained optimization, loss improvements <1% per step
- Final Performance: cosine annealing achieves 1-2% better validation accuracy than linear decay over same epoch count
Practical Examples and Benchmarks:
- BERT-Base Training: 1M steps total, 10K linear warmup, then cosine decay to near-zero โ 97.0% accuracy on GLUE (SuperGLUE benchmark)
- GPT-2 Training: 500K steps, 500 warmup steps (0.1%), then cosine decay โ loss 2.4 on WikiText-103 (SOTA at publication)
- Llama 2 Training: 2M steps, linear warmup 0.2%, cosine decay โ achieves consistent performance across model scales (7B to 70B)
- T5 Training: 1M steps, warmup 10K, cosine decay with minimum learning rate (0.1 ร base) โ prevents learning rate from decaying to zero
Advanced Scheduling Variants:
- Warmup and Polynomial Decay: lr = base_lr ร max(0, 1 - t/total_steps)^p where p โ [0.5, 2.0] โ alternative to cosine
- Step-Based Decay: reducing learning rate by factor (e.g., 0.1ร) at specific steps โ enables coarse-grained control
- Exponential Decay: lr(t) = base_lr ร decay_rate^t โ smooth exponential decrease
- Inverse Square Root: lr(t) = c / โt โ used in original Transformer paper, enables adaptive scaling to batch size
Interaction with Batch Size:
- Large Batch Training: larger batch sizes benefit from higher learning rates during warmup โ enables faster convergence
- Scaling Rule: lr_new = lr_old ร โ(batch_size_new / batch_size_old) โ LARS optimizer implements this
- Warmup Adjustment: warmup steps scale with effective batch size โ warmup_steps_new = warmup_steps ร (batch_size_new / batch_size_old)
- Linear Scaling Hypothesis: loss-batch size relationship enables proportional learning rate scaling
Optimizer-Specific Considerations:
- SGD Warmup: less critical than Adam, but still helpful for stability โ simple learning rate schedule often sufficient
- Adam Warmup: essential due to adaptive learning rate behavior โ without warmup, early adaptive rates too aggressive
- LAMB Optimizer: layer-wise adaptation enables larger batch sizes โ reduces warmup importance but still beneficial
- AdamW (Decoupled Weight Decay): improved optimizer enabling larger learning rates โ warmup remains important for stability
Multi-Phase Training Strategies:
- Pre-training then Fine-tuning: pre-training uses full warmup and cosine schedule over millions of steps; fine-tuning uses short warmup (500-1000 steps) with aggressive cosine decay
- Progressive Warmup: gradual increase of batch size combined with learning rate warmup โ enables stable large-batch training
- Cyclic Learning Rates: combining warmup with periodic restarts โ enables exploration of different loss regions
- Curriculum Learning Integration: warmup enables starting with easy examples, then annealing to harder distribution โ improves sample efficiency
Empirical Tuning Guidelines:
- Warmup Fraction: 5-10% of total training steps (10K out of 100K-200K typical) โ longer for larger models or harder tasks
- Cosine Minimum: setting minimum learning rate (e.g., 0.1 ร base) prevents decay to exactly zero โ maintains gradient signal
- Base Learning Rate: determined separately through grid search; typically 1e-4 to 5e-4 for fine-tuning, 1e-3 for pre-training
- Total Steps: estimated based on epochs ร steps_per_epoch; commonly 1-3M steps for pre-training, 10K-100K for fine-tuning
Distributed Training Considerations:
- Synchronization: warmup and annealing affect gradient updates across devices โ consistent schedules important for reproducibility
- Effective Batch Size: total batch size (per-GPU ร num_GPUs) determines learning rate scaling โ warmup duration should scale proportionally
- Checkpointing and Resumption: maintaining consistent learning rate schedule across checkpoint restarts โ track step count globally
Learning Rate Warmup and Cosine Scheduling are fundamental optimization techniques โ enabling stable training of deep networks through strategic learning rate management that combines initialization protection (warmup) with smooth convergence (cosine annealing).