Home Knowledge Base Loss Scaling Techniques

Loss Scaling Techniques are the numerical methods for preventing gradient underflow in FP16 training by multiplying the loss by a large scale factor (1024-65536) before backpropagation — amplifying small gradients into the representable FP16 range, then unscaling before the optimizer step, enabling stable FP16 training that would otherwise suffer from gradient underflow causing convergence stagnation, though largely obsoleted by BF16 which has sufficient range to avoid underflow without scaling.

Gradient Underflow Problem:

Static Loss Scaling:

Dynamic Loss Scaling:

Overflow Detection and Handling:

GradScaler Implementation (PyTorch):

Gradient Clipping with Loss Scaling:

Loss Scaling with Gradient Accumulation:

BF16 Eliminates Loss Scaling:

Debugging Loss Scaling Issues:

Advanced Techniques:

Performance Impact:

Loss scaling techniques are the numerical engineering that made FP16 training practical — by amplifying small gradients into the representable range and carefully managing overflow, loss scaling enabled 2-4× training speedup on Volta/Turing GPUs, though the advent of BF16 on Ampere/Hopper has largely obsoleted these techniques by providing sufficient numerical range without scaling complexity.

loss scaling techniquesdynamic loss scalinggradient scaling fp16loss scale overflowgradient underflow prevention

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.