Mixed Precision Training

Mixed Precision Training is the technique that uses lower precision (FP16 or BF16) for most computations while maintaining FP32 for critical operations — reducing memory usage by 40-50% and accelerating training by 2-3× on modern GPUs with Tensor Cores, while preserving model convergence and final accuracy through careful loss scaling and selective FP32 accumulation.

Precision Formats:
- FP32 (Float32): standard precision; 1 sign bit, 8 exponent bits, 23 mantissa bits; range 10^-38 to 10^38; precision ~7 decimal digits; default for deep learning training
- FP16 (Float16): half precision; 1 sign, 5 exponent, 10 mantissa; range 10^-8 to 65504; precision ~3 decimal digits; 2× memory reduction; supported on NVIDIA Volta+ (V100, A100, H100)
- BF16 (BFloat16): brain float; 1 sign, 8 exponent, 7 mantissa; same range as FP32 (10^-38 to 10^38); less precision but no overflow issues; preferred for training; supported on NVIDIA Ampere+ (A100, H100), Google TPU, Intel
- TF32 (TensorFloat32): NVIDIA format; 1 sign, 8 exponent, 10 mantissa; automatic on Ampere+ for FP32 operations; transparent speedup with no code changes; 8× faster matmul vs FP32

Mixed Precision Training Algorithm:
- Forward Pass: compute activations in FP16/BF16; store activations in FP16/BF16 for memory savings; matmul operations use Tensor Cores (8-16× faster than FP32 CUDA cores)
- Loss Computation: compute loss in FP16/BF16; apply loss scaling (multiply by large constant, typically 2^16) to prevent gradient underflow; scaled loss prevents small gradients from becoming zero in FP16
- Backward Pass: compute gradients in FP16/BF16; unscale gradients (divide by loss scale); check for inf/nan (indicates overflow); skip update if overflow detected
- Optimizer Step: convert FP16/BF16 gradients to FP32; maintain FP32 master copy of weights; update FP32 weights; convert back to FP16/BF16 for next iteration

Loss Scaling:
- Static Scaling: fixed scale factor (typically 2^16 for FP16); simple but may overflow or underflow; requires manual tuning per model
- Dynamic Scaling: automatically adjusts scale factor; increase by 2× every N steps if no overflow; decrease by 0.5× if overflow detected; typical N=2000; robust across models and tasks
- Gradient Clipping: clip gradients before unscaling; prevents extreme values from causing overflow; typical threshold 1.0-5.0; essential for stable training
- BF16 Advantage: BF16 rarely needs loss scaling due to larger exponent range; simplifies training; reduces overhead; preferred when available

Memory and Speed Benefits:
- Memory Reduction: activations and gradients in FP16/BF16 reduce memory by 40-50%; enables 1.5-2× larger batch sizes; critical for large models (GPT-3 scale requires mixed precision)
- Tensor Core Acceleration: FP16/BF16 matmul 8-16× faster than FP32 on Tensor Cores; A100 delivers 312 TFLOPS FP16 vs 19.5 TFLOPS FP32; H100 delivers 1000 TFLOPS FP16 vs 60 TFLOPS FP32
- Bandwidth Savings: 2× less data movement between HBM and compute; reduces memory bottleneck; particularly beneficial for memory-bound operations (element-wise, normalization)
- End-to-End Speedup: 2-3× faster training for large models (BERT, GPT, ResNet); speedup increases with model size; smaller models may see 1.5-2× due to overhead

Numerical Stability Considerations:
- Gradient Underflow: small gradients (<10^-8) become zero in FP16; loss scaling prevents this; critical for early layers in deep networks where gradients small
- Activation Overflow: large activations (>65504) overflow in FP16; rare with proper initialization and normalization; BF16 eliminates this issue
- Accumulation Precision: sum reductions (batch norm, softmax) use FP32 accumulation; prevents precision loss from many small additions; critical for numerical stability
- Layer Norm: compute in FP32 for stability; variance computation sensitive to precision; FP16 layer norm can cause training divergence

Framework Implementation:
- PyTorch AMP: torch.cuda.amp.autocast() for automatic mixed precision; GradScaler for loss scaling; minimal code changes; automatic operation selection (FP16 vs FP32)
- TensorFlow AMP: tf.keras.mixed_precision API; automatic loss scaling; policy-based precision control; seamless integration with Keras models
- NVIDIA Apex: legacy library for mixed precision; more manual control; still used for advanced use cases; being superseded by native framework support
- Automatic Operation Selection: frameworks automatically choose precision per operation; matmul in FP16/BF16, reductions in FP32, softmax in FP32; user can override for specific operations

Best Practices:
- Use BF16 When Available: simpler (no loss scaling), more stable, same speedup as FP16; preferred on A100, H100, TPU; FP16 only for older GPUs (V100)
- Gradient Accumulation: accumulate gradients in FP32 when using gradient accumulation; prevents precision loss over multiple accumulation steps
- Batch Size Tuning: increase batch size with saved memory; improves training stability and final accuracy; typical increase 1.5-2×
- Validation: verify convergence matches FP32 training; check final accuracy within 0.1-0.2%; monitor for inf/nan during training

Model-Specific Considerations:
- Transformers: work well with mixed precision; attention computation benefits from Tensor Cores; layer norm in FP32 critical; standard practice for BERT, GPT training
- CNNs: excellent mixed precision performance; conv operations highly optimized for Tensor Cores; batch norm in FP32; ResNet, EfficientNet train stably in FP16/BF16
- RNNs: more sensitive to precision; may require FP32 for hidden state accumulation; LSTM/GRU can diverge in FP16 without careful tuning; BF16 more stable
- GANs: discriminator/generator can have different precision needs; may require FP32 for discriminator stability; generator typically fine in FP16/BF16

Mixed Precision Training is the essential technique that makes modern large-scale deep learning practical — by leveraging specialized hardware (Tensor Cores) and careful numerical management, it delivers 2-3× speedup and 40-50% memory reduction with no accuracy loss, enabling the training of models that would otherwise be impossible within reasonable time and budget constraints.

Want to learn more?