Layer Normalization Variants are extensions and modifications of the standard LayerNorm — adapting the normalization computation for specific architectures, modalities, or efficiency requirements.
Key Variants
- Pre-Norm: LayerNorm applied before the attention/FFN (used in GPT-2+). More stable for deep transformers.
- Post-Norm: LayerNorm applied after the attention/FFN (original Transformer). Better final quality but harder to train deeply.
- RMSNorm: Removes the mean-centering step. Only normalizes by root mean square. Used in LLaMA, Gemma.
- DeepNorm: Scales residual connections to enable training 1000-layer transformers.
- QK-Norm: Applies LayerNorm to query and key vectors in attention (prevents attention logit growth).
Why It Matters
- Architecture-Dependent: The choice of normalization variant significantly impacts training stability and final performance.
- Scaling: Pre-Norm + RMSNorm is standard for billion-parameter LLMs due to training stability.
- Research: Active area with new variants proposed regularly as architectures evolve.
LayerNorm Variants are the normalization toolkit for transformers — each variant tuned for a specific architectural need.