Home Knowledge Base Weight Initialization

Weight Initialization is the process of setting neural network parameter values before training begins — a critical but often underappreciated decision that directly determines whether gradients flow correctly, whether training converges, and how quickly the model learns. Poor initialization causes either vanishing gradients (parameters barely update) or exploding gradients (training diverges with NaN losses), while proper initialization keeps the variance of activations and gradients stable across all layers from the very first step.

Why Initialization Matters: The Variance Propagation Problem

Consider a network with $L$ layers, each with $n$ neurons. At initialization with weights sampled i.i.d. from $N(0, \sigma^2)$:

For gradients (via backpropagation), exactly the same analysis applies in the reverse direction. Proper initialization ensures both forward activations and backward gradients maintain stable variance across all layers.

The Symmetry Breaking Requirement

Before discussing which distribution to use, a fundamental constraint: zero initialization is catastrophically wrong for hidden layers.

If all weights $W = 0$:

This is the symmetry problem — any initialization that fails to break symmetry (e.g., all-zeros, all-constants) prevents the network from leveraging its full representational capacity. Weights must be initialized with random, different values.

Xavier/Glorot Initialization (2010)

Glorot and Bengio derived the correct initialization for networks with sigmoid or tanh activations:

$$W \sim U\!\left[-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right]$$

or equivalently (normal form):

$$W \sim N\!\left(0, \frac{2}{n_{in} + n_{out}}\right)$$

The derivation balances variance preservation in both the forward pass and backward pass simultaneously. Named after Xavier Glorot, now default in Keras, PyTorch torch.nn.Linear, and TensorFlow.

He/Kaiming Initialization (2015)

He et al. (Microsoft Research) derived initialization for ReLU networks. ReLU zeroes approximately half its inputs, effectively halving the variance. The He correction doubles the scale to compensate:

$$W \sim N\!\left(0, \frac{2}{n_{in}}\right)$$

For Leaky ReLU with slope $a$ (where $a=0.01$ typically):

$$W \sim N\!\left(0, \frac{2}{(1+a^2) \cdot n_{in}}\right)$$

He initialization is the standard for all ReLU-family activations (ReLU, Leaky ReLU, ELU). PyTorch uses He initialization by default for Conv2d layers.

Initialization for Transformers

Modern LLMs require specialized initialization strategies:

ComponentInitializationReasoning
Embedding layers$N(0, 1)$ or $N(0, 0.02)$Small enough to start near-uniform distribution
Attention Q, K projectionsXavier or $N(0, \sigma)$ with $\sigma = 1/\sqrt{d_{model}}$Preserve attention logit scale
Attention V, output proj.$N(0, 0.02)$ or $N(0, \sigma/\sqrt{2L})$Small output prevents early residual domination
Feed-forward layersHe (ReLU) or Xavier (GELU)Match activation function
Layer Norm $\gamma$1.0Identity at init
Layer Norm $\beta$0.0No shift at init

GPT-2 Initialization Strategy: Output projection weights (before adding to residual stream) are initialized with $N(0, 0.02/\sqrt{2L})$ where $L$ is the number of transformer layers. This scaled initialization ensures the residual stream grows slowly at initialization, improving gradient flow in deep networks.

LLaMA Initialization: Uses standard He initialization for linear layers with $a = \sqrt{5}$ (the PyTorch default for Linear layers using Kaiming uniform), but with RMSNorm scales initialized to 1.0 and no bias terms.

Orthogonal Initialization

For recurrent networks (LSTMs, GRUs), orthogonal initialization is important:

$$W = Q \text{ where } Q \text{ is a random orthogonal matrix}$$

Orthogonal matrices have singular values all equal to 1, so they preserve gradient norms through the recurrent connection. Critical for training deep RNNs or LSTMs on long sequences.

Practical Rules

Network TypeActivationRecommended Init
CNN (image classification)ReLUHe/Kaiming
MLP (feed-forward)ReLU/Leaky ReLUHe
MLP (classification head)Sigmoid/TanhXavier/Glorot
Transformer (attention)Xavier or scaled
Transformer (FFN with GELU)GELUXavier or He
RNN/LSTMTanhXavier + orthogonal recurrent
Embeddings$N(0, 0.02)$ or $U(-1, 1)$

Pre-trained Initialization: Transfer Learning

For fine-tuning pre-trained models (the dominant paradigm in 2024-2026):

Proper initialization is not a detail — it is a prerequisite for training to work at all. Modern frameworks handle it automatically, but understanding why specific initializations are needed is essential for debugging training instabilities, designing custom architectures, and understanding why models trained from scratch behave differently from fine-tuned ones.

weight initializationxavier initializationhe initializationkaiming initializationglorot initialization

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.