Home Knowledge Base Fixup Initialization

Fixup Initialization is a weight initialization scheme for residual networks that enables stable training of arbitrarily deep networks without any normalization layers — by carefully scaling the initial weights of residual branches inversely with network depth, ensuring the gradient signal propagates correctly through hundreds of layers at initialization without the normalizing effect of batch normalization — published by Zhang et al. (2019) as a theoretically motivated alternative to BatchNorm that enables small-batch and single-example training, removes the sequential coupling between samples that BatchNorm imposes, and provides simpler training dynamics for theoretical analysis.

What Is Fixup Initialization?

Why Fixup Works: Theoretical Basis

The core insight is signal and gradient propagation at initialization:

Fixup vs. Batch Normalization

PropertyBatch NormalizationFixup Initialization
NormalizationDynamic, computed over batchStatic, achieved at init via scaling
Small batch trainingNoisy estimates, degradesWorks perfectly (no batch statistics)
Single-example inferenceRequires stored running statsExact (no statistics needed)
Sequential couplingSamples in same batch interactFully independent examples
Theoretical cleanlinessComplex stochastic dynamicsClean, analyzable gradient flow
Performance on standard benchmarksSlightly better (large batch)Competitive, especially small batch

Practical Applications

Fixup Initialization is the normalization-free path to training deep residual networks — proving that stability across hundreds of layers requires not runtime statistics but the right initial weight geometry, opening a theoretically clean and practically powerful alternative to the BatchNorm paradigm for specialized training scenarios.

fixup initializationoptimization

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.