Weight Initialization is the process of setting neural network parameter values before training begins — a critical but often underappreciated decision that directly determines whether gradients flow correctly, whether training converges, and how quickly the model learns. Poor initialization causes either vanishing gradients (parameters barely update) or exploding gradients (training diverges with NaN losses), while proper initialization keeps the variance of activations and gradients stable across all layers from the very first step.
Why Initialization Matters: The Variance Propagation Problem
Consider a network with $L$ layers, each with $n$ neurons. At initialization with weights sampled i.i.d. from $N(0, \sigma^2)$:
- Variance of activations grows as $L$ increases if $\sigma^2 > 1/n$
- Variance of activations shrinks toward 0 if $\sigma^2 < 1/n$
- Only at the critical $\sigma^2 \approx 1/n$ does variance remain stable
For gradients (via backpropagation), exactly the same analysis applies in the reverse direction. Proper initialization ensures both forward activations and backward gradients maintain stable variance across all layers.
The Symmetry Breaking Requirement
Before discussing which distribution to use, a fundamental constraint: zero initialization is catastrophically wrong for hidden layers.
If all weights $W = 0$:
- All neurons in a layer compute identical outputs
- All neurons receive identical gradients
- All neurons update identically forever
- The network behaves as a single neuron regardless of width
This is the symmetry problem — any initialization that fails to break symmetry (e.g., all-zeros, all-constants) prevents the network from leveraging its full representational capacity. Weights must be initialized with random, different values.
Xavier/Glorot Initialization (2010)
Glorot and Bengio derived the correct initialization for networks with sigmoid or tanh activations:
$$W \sim U\!\left[-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right]$$
or equivalently (normal form):
$$W \sim N\!\left(0, \frac{2}{n_{in} + n_{out}}\right)$$
The derivation balances variance preservation in both the forward pass and backward pass simultaneously. Named after Xavier Glorot, now default in Keras, PyTorch torch.nn.Linear, and TensorFlow.
He/Kaiming Initialization (2015)
He et al. (Microsoft Research) derived initialization for ReLU networks. ReLU zeroes approximately half its inputs, effectively halving the variance. The He correction doubles the scale to compensate:
$$W \sim N\!\left(0, \frac{2}{n_{in}}\right)$$
For Leaky ReLU with slope $a$ (where $a=0.01$ typically):
$$W \sim N\!\left(0, \frac{2}{(1+a^2) \cdot n_{in}}\right)$$
He initialization is the standard for all ReLU-family activations (ReLU, Leaky ReLU, ELU). PyTorch uses He initialization by default for Conv2d layers.
Initialization for Transformers
Modern LLMs require specialized initialization strategies:
| Component | Initialization | Reasoning |
|---|---|---|
| Embedding layers | $N(0, 1)$ or $N(0, 0.02)$ | Small enough to start near-uniform distribution |
| Attention Q, K projections | Xavier or $N(0, \sigma)$ with $\sigma = 1/\sqrt{d_{model}}$ | Preserve attention logit scale |
| Attention V, output proj. | $N(0, 0.02)$ or $N(0, \sigma/\sqrt{2L})$ | Small output prevents early residual domination |
| Feed-forward layers | He (ReLU) or Xavier (GELU) | Match activation function |
| Layer Norm $\gamma$ | 1.0 | Identity at init |
| Layer Norm $\beta$ | 0.0 | No shift at init |
GPT-2 Initialization Strategy: Output projection weights (before adding to residual stream) are initialized with $N(0, 0.02/\sqrt{2L})$ where $L$ is the number of transformer layers. This scaled initialization ensures the residual stream grows slowly at initialization, improving gradient flow in deep networks.
LLaMA Initialization: Uses standard He initialization for linear layers with $a = \sqrt{5}$ (the PyTorch default for Linear layers using Kaiming uniform), but with RMSNorm scales initialized to 1.0 and no bias terms.
Orthogonal Initialization
For recurrent networks (LSTMs, GRUs), orthogonal initialization is important:
$$W = Q \text{ where } Q \text{ is a random orthogonal matrix}$$
Orthogonal matrices have singular values all equal to 1, so they preserve gradient norms through the recurrent connection. Critical for training deep RNNs or LSTMs on long sequences.
Practical Rules
| Network Type | Activation | Recommended Init |
|---|---|---|
| CNN (image classification) | ReLU | He/Kaiming |
| MLP (feed-forward) | ReLU/Leaky ReLU | He |
| MLP (classification head) | Sigmoid/Tanh | Xavier/Glorot |
| Transformer (attention) | — | Xavier or scaled |
| Transformer (FFN with GELU) | GELU | Xavier or He |
| RNN/LSTM | Tanh | Xavier + orthogonal recurrent |
| Embeddings | — | $N(0, 0.02)$ or $U(-1, 1)$ |
Pre-trained Initialization: Transfer Learning
For fine-tuning pre-trained models (the dominant paradigm in 2024-2026):
- Model weights start from pre-trained values (not random)
- New task-specific heads are randomly initialized (Xavier/He)
- LoRA adapters are initialized so the initial adaptation is zero: $A \sim N(0, \sigma)$, $B = 0$
- This preserves the pre-trained model behavior at the start of fine-tuning
Proper initialization is not a detail — it is a prerequisite for training to work at all. Modern frameworks handle it automatically, but understanding why specific initializations are needed is essential for debugging training instabilities, designing custom architectures, and understanding why models trained from scratch behave differently from fine-tuned ones.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.