Positional Encoding Variants

Positional Encoding Variants encompass the diverse methods for injecting position information into neural network architectures—particularly Transformers—that are otherwise permutation-invariant and cannot distinguish token order or spatial location. Since self-attention treats inputs as unordered sets, positional encodings provide the essential spatial or sequential structure that enables Transformers to process language, images, and other structured data where position carries meaning.

Why Positional Encoding Variants Matter in AI/ML:
Positional encodings are critical for Transformer performance because they provide the only mechanism by which these networks understand sequence order, relative distance, and spatial relationships—without them, "the cat sat on the mat" and "mat the on sat cat the" would be indistinguishable.

• Sinusoidal (original Transformer) — Fixed encoding using sine and cosine at geometrically increasing frequencies: PE(pos,2i) = sin(pos/10000^(2i/d)), PE(pos,2i+1) = cos(pos/10000^(2i/d)); the trigonometric structure enables the model to learn relative position via linear projections
• Learned absolute — Trainable embedding vectors for each position (one per position up to max length); simple and effective but cannot generalize to sequences longer than training length; used in BERT and GPT-2
• Rotary Position Embedding (RoPE) — Encodes position by rotating query and key vectors in 2D subspaces; the relative position information naturally emerges in the attention dot product; supports length extrapolation better than absolute encodings
• ALiBi (Attention with Linear Biases) — Adds a linear bias proportional to key-query distance directly to attention scores: bias = -m·|i-j| where m is a head-specific slope; simple, parameter-free, and enables strong length extrapolation
• Relative position bias — T5-style learned relative position biases add a learned scalar to attention logits based on the relative distance between tokens; bins logarithmically for long distances

| Encoding | Type | Length Extrapolation | Parameters | Used In |
|----------|------|---------------------|-----------|---------|
| Sinusoidal | Fixed, absolute | Poor | 0 | Original Transformer |
| Learned Absolute | Learned, absolute | None | pos × d | BERT, GPT-2 |
| RoPE | Rotary, relative | Good | 0 | LLaMA, PaLM, Mistral |
| ALiBi | Linear bias, relative | Excellent | 0 (per-head slopes) | BLOOM, MPT |
| T5 Relative Bias | Learned, relative | Moderate | n_heads × n_buckets | T5, Flan-T5 |
| Conditional (cPE) | Input-dependent | Good | Learned | Some vision transformers |

Positional encoding variants are a fundamental design choice for Transformer architectures that directly impacts length generalization, relative distance modeling, and computational efficiency, with the evolution from fixed sinusoidal encodings to rotary and linear bias methods reflecting the field's deepening understanding of how position information should be integrated into attention-based computation.

Want to learn more?