Loss Scaling Techniques

Loss Scaling Techniques are the numerical methods for preventing gradient underflow in FP16 training by multiplying the loss by a large scale factor (1024-65536) before backpropagation — amplifying small gradients into the representable FP16 range, then unscaling before the optimizer step, enabling stable FP16 training that would otherwise suffer from gradient underflow causing convergence stagnation, though largely obsoleted by BF16 which has sufficient range to avoid underflow without scaling.

Gradient Underflow Problem:
- FP16 Range: smallest positive normal number is 2⁻¹⁴ ≈ 6×10⁻⁵; gradients smaller than this underflow to zero; common in later training stages when gradients become small
- Impact: underflowed gradients cause weights to stop updating; training stagnates; validation loss plateaus; model fails to converge to optimal accuracy
- Frequency: without loss scaling, 20-50% of gradients underflow in typical deep networks; critical layers (early layers in ResNet, embedding layers in Transformers) particularly affected
- Detection: histogram of gradient magnitudes shows spike at zero; indicates underflow; compare FP16 vs FP32 gradient distributions

Static Loss Scaling:
- Mechanism: multiply loss by fixed scale S before backward(); loss_scaled = loss × S; gradients scaled by S; unscale before optimizer: grad_unscaled = grad_scaled / S
- Scale Selection: typical values 128-2048; too small → underflow persists; too large → overflow (gradients >65504); requires manual tuning per model and dataset
- Implementation: loss_scaled = loss * scale; loss_scaled.backward(); for param in model.parameters(): param.grad /= scale; optimizer.step()
- Limitations: optimal scale varies during training; early training tolerates higher scale; late training requires lower scale; static scale suboptimal throughout training

Dynamic Loss Scaling:
- Adaptive Scaling: automatically adjusts scale based on overflow detection; starts high (65536); decreases on overflow; increases when stable; converges to optimal scale
- Growth Phase: if no overflow for N consecutive steps (N=2000 typical), scale *= 2; gradually increases to maximize gradient precision; exploits periods of stability
- Backoff Phase: if overflow detected (any gradient contains Inf/NaN), scale /= 2; skip optimizer step; prevents NaN propagation; retries next iteration with lower scale
- Convergence: scale typically converges to 1024-8192; balances underflow prevention (scale too low) with overflow avoidance (scale too high); adapts to training dynamics

Overflow Detection and Handling:
- Detection: check if any gradient contains Inf or NaN; torch.isfinite(grad).all() for each parameter; single Inf/NaN indicates overflow
- Skip Step: when overflow detected, skip optimizer.step(); weights unchanged; prevents NaN propagation through model; training continues with reduced scale
- Gradient Zeroing: zero_grad() after skipped step; clears overflowed gradients; next iteration uses reduced scale; typically succeeds without overflow
- Frequency: well-tuned dynamic scaling overflows 0.1-1% of steps; higher frequency indicates scale too aggressive or learning rate too high

GradScaler Implementation (PyTorch):
- Initialization: scaler = torch.cuda.amp.GradScaler(init_scale=65536, growth_factor=2, backoff_factor=0.5, growth_interval=2000)
- Forward and Backward: with autocast(): loss = model(input); scaler.scale(loss).backward(); — scales loss, computes scaled gradients
- Optimizer Step: scaler.step(optimizer); — unscales gradients, checks for overflow, steps optimizer if no overflow, skips if overflow
- Scale Update: scaler.update(); — adjusts scale based on overflow status; increases if no overflow for growth_interval steps; decreases if overflow
- State Management: scaler maintains internal state (current scale, growth tracker, overflow status); persists across iterations; enables adaptive behavior

Gradient Clipping with Loss Scaling:
- Unscale Before Clipping: scaler.unscale_(optimizer); torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm); scaler.step(optimizer); scaler.update()
- Reason: gradient norm computed on scaled gradients is incorrect; norm_scaled = norm_unscaled × scale; clipping on scaled gradients clips at wrong threshold
- Unscale Operation: divides all gradients by current scale; makes gradients comparable to FP32 training; enables correct norm calculation and clipping
- Multiple Unscale: calling unscale_() multiple times is safe (no-op after first call); enables flexible code organization

Loss Scaling with Gradient Accumulation:
- Scaling Pattern: loss_scaled = (loss / accumulation_steps) * scale; loss_scaled.backward(); — scale accounts for both accumulation and FP16
- Accumulation: gradients accumulate in scaled form; unscale once after all accumulation steps; optimizer step uses unscaled accumulated gradients
- Implementation: for i in range(accumulation_steps): loss = model(input[i]); scaler.scale(loss / accumulation_steps).backward(); scaler.step(optimizer); scaler.update(); optimizer.zero_grad()

BF16 Eliminates Loss Scaling:
- BF16 Range: smallest positive normal number is 2⁻¹²⁶ ≈ 1×10⁻³⁸; same exponent range as FP32; gradient underflow extremely rare
- Simplified Code: no GradScaler needed; with autocast(dtype=torch.bfloat16): loss.backward(); optimizer.step(); — 2 lines vs 5 for FP16
- Stability: BF16 training stability comparable to FP32; FP16 occasionally diverges even with dynamic scaling; BF16 rarely diverges
- Recommendation: use BF16 on Ampere/Hopper; use FP16 with loss scaling only on Volta/Turing

Debugging Loss Scaling Issues:
- Scale Monitoring: log scaler.get_scale() every N steps; if scale <100, frequent overflow; if scale >100000, possible underflow; optimal 1024-8192
- Overflow Frequency: count skipped steps; >5% indicates problem; reduce learning rate or use BF16; <0.1% is normal
- Gradient Histogram: plot gradient magnitudes; spike at zero indicates underflow; spike at 65504 indicates overflow; normal distribution indicates good scaling
- Convergence Comparison: compare FP16+scaling vs FP32 convergence; if FP16 diverges or converges slower, increase initial scale or use BF16

Advanced Techniques:
- Per-Layer Scaling: different scale for different layers; early layers use higher scale (smaller gradients); later layers use lower scale (larger gradients); complex but optimal
- Adaptive Growth Interval: adjust growth_interval based on overflow frequency; frequent overflow → longer interval; rare overflow → shorter interval; faster convergence to optimal scale
- Scale Warmup: start with low scale (1024), gradually increase to 65536 over first 1000 steps; prevents early training instability; then switch to dynamic scaling
- Overflow Prediction: predict overflow before it occurs using gradient statistics; preemptively reduce scale; avoids skipped steps; experimental technique

Performance Impact:
- Overhead: loss scaling adds <1% overhead; scale/unscale operations are element-wise multiplications; negligible compared to forward/backward pass
- Skipped Steps: each skipped step wastes one forward+backward pass; 1% overflow rate → 1% wasted compute; acceptable for stability benefits
- Memory: GradScaler state is <1 KB; negligible memory overhead; no impact on batch size or model size

Loss scaling techniques are the numerical engineering that made FP16 training practical — by amplifying small gradients into the representable range and carefully managing overflow, loss scaling enabled 2-4× training speedup on Volta/Turing GPUs, though the advent of BF16 on Ampere/Hopper has largely obsoleted these techniques by providing sufficient numerical range without scaling complexity.

Want to learn more?