The core trick is a full-precision master copy of the weights.

Home› Knowledge Base› The core trick is a full-precision master copy of the weights.

Mixed-precision training is the standard recipe that lets modern models train in half the memory and roughly twice the throughput without losing accuracy. The idea is simple to state and subtle to get right: do the heavy compute — the matrix multiplies in the forward and backward pass — in a 16-bit format that the hardware's tensor cores chew through fast, while keeping a full-precision copy of the things that must stay accurate. Every large model today is trained this way, and the two failure modes it has to defend against — underflow of tiny gradients and drift of slowly-accumulating weights — are exactly what the recipe is built around.\n\nThe core trick is a full-precision master copy of the weights. You keep the authoritative weights in FP32, cast a 16-bit copy for each step's forward and backward pass, compute the gradients in 16-bit, and then apply the update to the FP32 master weights. This matters because a weight update is often many times smaller than the weight itself; in pure 16-bit, that tiny increment rounds away to nothing and training silently stalls. Accumulating the update into an FP32 master copy preserves it. Reductions like the loss and the gradient accumulation are likewise done in FP32.\n\nFP16 and BF16 make opposite trade-offs with the same 16 bits. FP16 spends 5 bits on the exponent and 10 on the mantissa: good precision, but a narrow dynamic range, so small gradients fall below the smallest representable value and underflow to zero. BF16 spends 8 exponent bits — the same range as FP32 — and only 7 on the mantissa: coarser precision, but it covers the full FP32 range, so gradients almost never underflow. That single difference is why BF16 has largely won for training: it needs no special handling, whereas FP16 requires loss scaling to be usable.\n\nLoss scaling is how you make FP16 safe. Before the backward pass you multiply the loss by a large constant S, which shifts the entire gradient distribution up out of the FP16 underflow region; after backprop, and before the optimizer step, you divide the gradients back down by S. Dynamic loss scaling automates the choice of S: it pushes S up until a gradient overflows to infinity, then backs off and skips that step, continually tracking the largest safe value. BF16's wide range means you can usually skip loss scaling entirely.\n\nThe payoff is why it is universal. Sixteen-bit matrix multiplies run at roughly twice the rate of FP32 on tensor-core hardware, and the activations stored for the backward pass take half the memory — often the difference between a model fitting on a device or not. NVIDIA's TF32 is a related middle ground that keeps FP32 range with reduced mantissa for the matmul inputs, and FP8 pushes the same idea further for the largest training runs. In every case the principle is identical: compute cheap, but keep a precise master copy so the small quantities survive.\n\n| Format | Exponent / mantissa bits | Dynamic range | Loss scaling? | Role |\n|---|---|---|---|---|\n| FP32 | 8 / 23 | Full | n/a | Master weights, reductions |\n| TF32 | 8 / 10 | FP32 range | No | Matmul inputs (NVIDIA) |\n| BF16 | 8 / 7 | FP32 range | Usually no | Default training compute |\n| FP16 | 5 / 10 | Narrow | Yes | Training compute (needs scaling) |\n| FP8 | 4-5 / 2-3 | Very narrow | Yes (per-tensor) | Largest-scale training |\n\n``svg\n\n``\n\nThe shallow reading of mixed precision is "use fewer bits to go faster." That misses the whole engineering problem, which is that not every number in training can afford fewer bits. The weight updates and the reductions need range and precision the 16-bit formats cannot give them, so the technique is really about sorting the numbers: heavy matmuls go cheap, the master weights and accumulations stay precise, and loss scaling shuttles the gradient distribution into whatever range the compute format can represent. Read mixed precision through a keep-a-precise-master-copy-while-computing-cheap lens rather than a just-use-fewer-bits lens, and the choice between BF16 and FP16, and the need for loss scaling, follow directly from one question: does this number need dynamic range, or precision, or both?

automatic mixed precision (amp)automatic mixed precisionampmodel training

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.

🔍 Search Topics 💬 Ask CFSGPT 📚 Browse All