Learned position embeddings

Learned position embeddings are trainable parameter vectors assigned to each spatial position in a Vision Transformer's input sequence — providing the model with spatial location information by adding a unique, learned vector to each patch token so the transformer can distinguish where in the image each patch originated.

What Are Learned Position Embeddings?

- Definition: A set of trainable vectors, one per input sequence position, that are added to the patch embeddings before processing by transformer encoder layers. For ViT-Base with 196 patches + 1 CLS token, this is a learnable parameter matrix of shape (197, 768).
- Origin: Derived from the original Transformer architecture (Vaswani et al., 2017) and adapted for vision by ViT (Dosovitskiy et al., 2020).
- Initialization: Typically initialized randomly (normal or uniform distribution) and optimized during training through backpropagation like any other model parameter.
- Addition Operation: Position information is injected by element-wise addition: token_input = patch_embedding + position_embedding[i] for position i.

Why Learned Position Embeddings Matter

- Spatial Awareness: Without position embeddings, the transformer treats the input as a bag of patches with no spatial ordering — it cannot distinguish top-left from bottom-right, making spatial reasoning impossible.
- Permutation Invariance Problem: Self-attention is inherently permutation-equivariant — the output is the same regardless of input ordering. Position embeddings break this symmetry and inject spatial structure.
- Simplicity: Learned embeddings are the simplest position encoding — just add a parameter matrix. No special implementation, no mathematical formulas, no architectural modifications.
- Task Adaptation: The model can learn task-specific position patterns — for classification, it might learn center-weighted position biases; for detection, it might learn edge-aware position patterns.
- Empirical Baseline: Learned position embeddings remain a strong baseline — the original ViT showed minimal difference between learned and fixed sinusoidal position embeddings.

How Learned Position Embeddings Work

Training Phase:
- Initialize position_embedding as a learnable nn.Parameter of shape (N+1, D).
- At each forward pass: x = patch_embed(image) + position_embedding.
- Gradients flow through position embeddings during backpropagation.
- The model learns to assign vectors that encode useful spatial information.

What the Model Learns:
- Analysis of trained position embeddings reveals clear spatial structure:
- Nearby positions have similar embeddings (high cosine similarity).
- Same-row and same-column positions show strong correlation patterns.
- The 2D spatial grid structure emerges naturally despite being stored as a 1D list.
- Corner and edge positions are distinct from center positions.

Limitations of Learned Position Embeddings

| Limitation | Description | Impact |
|-----------|-----------|--------|
| Fixed Sequence Length | Trained for specific number of positions (e.g., 197) | Cannot handle different resolutions natively |
| Resolution Mismatch | Training at 224×224 (196 patches), inference at 384×384 (576 patches) requires interpolation | Performance degradation at non-training resolutions |
| Interpolation Artifacts | Bicubic interpolation of position embeddings introduces artifacts | Especially problematic for large resolution changes |
| No Translation Invariance | Position (3,5) and (10,5) have independent embeddings | Must learn spatial patterns at every position separately |
| Data Hungry | Needs sufficient training data to learn meaningful position patterns | May underfit with limited data |

Resolution Transfer Protocol

When fine-tuning a ViT at a different resolution than pretraining:
1. Reshape 1D position embeddings to 2D grid: (N,) → (H_train, W_train).
2. Apply bicubic interpolation to new grid: (H_train, W_train) → (H_new, W_new).
3. Flatten back to 1D: (H_new × W_new,).
4. Fine-tune with the interpolated position embeddings (typically with lower learning rate for positions).

Learned Position Embeddings vs. Alternatives

| Method | Learned | Resolution Flexible | Translation Invariant | Parameters |
|--------|---------|--------------------|--------------------|-----------|
| Learned Absolute | Yes | No | No | N × D |
| Sinusoidal Fixed | No | Partially | No | 0 |
| Relative Bias | Yes | Yes (within window) | Yes | (2M-1)² |
| CPE (Convolutional) | Yes | Yes | Yes | 9C |
| RoPE | No | Yes | Yes | 0 |
| No Position | — | Yes | Yes | 0 |

Learned position embeddings are the simplest and most intuitive spatial encoding for Vision Transformers — while newer alternatives offer better resolution flexibility and translation invariance, learned embeddings remain the default choice in many architectures due to their simplicity, strong baseline performance, and ease of implementation.

Want to learn more?