Home Knowledge Base Mixed Precision Training

Mixed Precision Training is a deep learning training technique that uses lower-precision floating-point formats (FP16, BF16, or FP8) for computation while maintaining FP32 master weights for numerical stability — delivering up to 3× throughput improvement and 2× memory reduction on modern AI accelerators with minimal impact on model accuracy, and now considered the default training mode for essentially all large-scale deep learning work.

Why Numerical Precision Matters

Neural network training involves billions of floating-point multiply-accumulate operations per step. Higher-precision formats (FP32, FP64) represent real numbers with more bits, reducing rounding errors that accumulate across deep networks. However, higher precision comes at a direct throughput cost: NVIDIA H100 delivers 989 TFLOPS FP32 but 3,958 TFLOPS FP8 — a 4× gap that translates directly to training speed.

The fundamental insight of mixed precision training is that different operations have different precision requirements:

FP32 vs FP16 vs BF16 vs FP8

FormatTotal BitsExponent BitsMantissa BitsDynamic RangeNotes
FP3232823±3.4×10^38Standard training default (legacy)
FP1616510±6.5×10^4Needs loss scaling; overflow risk
BF161687±3.4×10^38Same range as FP32; preferred for training
FP8 E4M3843±448Forward pass optimized
FP8 E5M2852±57344Backward pass optimized

BF16 vs FP16 — The Critical Difference

FP16 has only 5 exponent bits, giving it a much smaller dynamic range than FP32. Gradient values during backpropagation span many orders of magnitude — gradients for early layers in deep networks can be many thousands of times smaller than gradients for final layers. FP16 loses these small gradients entirely (they underflow to zero), which is why FP16 training requires loss scaling.

BF16 trades mantissa precision for exponent range — it has the same 8 exponent bits as FP32, so it never overflows or underflows where FP32 would. On NVIDIA A100/H100 and Google TPUs (which natively support BF16), BF16 is strictly preferable: same dynamic range as FP32, no loss scaling required, 2× memory saving. On older hardware (V100, which supports FP16 but not BF16 natively), FP16 with loss scaling is the only option.

The Standard Mixed Precision Recipe (AMP)

The NVIDIA-recommended procedure, implemented by PyTorch's torch.cuda.amp:

1. Maintain FP32 master weights: The optimizer always stores and updates the authoritative copy of weights in FP32 2. Cast to FP16/BF16 for compute: Before each forward pass, weights are cast from FP32 to FP16/BF16. Activations and gradients are computed in half precision on Tensor Cores 3. Loss scaling (FP16 only): Multiply the loss by a large constant (e.g., 2^16) before backward pass to shift gradient values into the representable FP16 range. Unscale before the optimizer step 4. FP32 gradient accumulation: Gradients from FP16 backward pass are converted back to FP32 and accumulated into FP32 master copies 5. FP32 optimizer step: Adam/AdamW updates the FP32 master weights using FP32 gradients

PyTorch AMP Implementation

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # Only needed for FP16; BF16 does not need it

for batch in dataloader:
    with autocast(dtype=torch.bfloat16):  # or torch.float16
        output = model(batch)
        loss = criterion(output, target)
    
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

For BF16, the GradScaler is redundant but kept for API compatibility. Modern code omits it for BF16 training.

FP8 Training — The Frontier (H100 and Beyond)

NVIDIA H100 introduced hardware-native FP8 support via the Transformer Engine library. FP8 training follows a more complex protocol:

FP8 training achieves ~2× throughput over BF16 on H100 for transformer-dominated workloads, with accuracy recovery requiring careful tuning of the scaling factor update frequency.

Memory and Throughput Gains

For a 7B parameter model trained on H100:

ConfigurationModel MemoryActivation MemoryThroughput
FP32 full28 GB~40 GB1× baseline
AMP BF1614 GB weights + 28 GB master~20 GB~2.5×
FP8 training7 GB weights + 28 GB master~10 GB~4×

The FP32 master weights persist throughout training regardless of compute precision — this is a fixed 4 bytes/parameter cost that cannot be eliminated without sacrificing training stability.

Integration with Distributed Training

Mixed precision interacts with all major distributed training frameworks:

Mixed precision is not optional at scale — training GPT-4 class models purely in FP32 would require 4× more GPU-hours and 2× more GPU memory, adding tens of millions of dollars to the pre-training budget.

fp8 traininghalf precision trainingbfloat16 fp16 comparisonloss scaling ampautomatic mixed precision fp8

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.