Learned position embeddings are trainable parameter vectors assigned to each spatial position in a Vision Transformer's input sequence — providing the model with spatial location information by adding a unique, learned vector to each patch token so the transformer can distinguish where in the image each patch originated.
What Are Learned Position Embeddings?
- Definition: A set of trainable vectors, one per input sequence position, that are added to the patch embeddings before processing by transformer encoder layers. For ViT-Base with 196 patches + 1 CLS token, this is a learnable parameter matrix of shape (197, 768).
- Origin: Derived from the original Transformer architecture (Vaswani et al., 2017) and adapted for vision by ViT (Dosovitskiy et al., 2020).
- Initialization: Typically initialized randomly (normal or uniform distribution) and optimized during training through backpropagation like any other model parameter.
- Addition Operation: Position information is injected by element-wise addition: token_input = patch_embedding + position_embedding[i] for position i.
Why Learned Position Embeddings Matter
- Spatial Awareness: Without position embeddings, the transformer treats the input as a bag of patches with no spatial ordering — it cannot distinguish top-left from bottom-right, making spatial reasoning impossible.
- Permutation Invariance Problem: Self-attention is inherently permutation-equivariant — the output is the same regardless of input ordering. Position embeddings break this symmetry and inject spatial structure.
- Simplicity: Learned embeddings are the simplest position encoding — just add a parameter matrix. No special implementation, no mathematical formulas, no architectural modifications.
- Task Adaptation: The model can learn task-specific position patterns — for classification, it might learn center-weighted position biases; for detection, it might learn edge-aware position patterns.
- Empirical Baseline: Learned position embeddings remain a strong baseline — the original ViT showed minimal difference between learned and fixed sinusoidal position embeddings.
How Learned Position Embeddings Work
Training Phase:
- Initialize position_embedding as a learnable nn.Parameter of shape (N+1, D).
- At each forward pass: x = patch_embed(image) + position_embedding.
- Gradients flow through position embeddings during backpropagation.
- The model learns to assign vectors that encode useful spatial information.
What the Model Learns:
- Analysis of trained position embeddings reveals clear spatial structure:
- Nearby positions have similar embeddings (high cosine similarity).
- Same-row and same-column positions show strong correlation patterns.
- The 2D spatial grid structure emerges naturally despite being stored as a 1D list.
- Corner and edge positions are distinct from center positions.
Limitations of Learned Position Embeddings
| Limitation | Description | Impact |
|-----------|-----------|--------|
| Fixed Sequence Length | Trained for specific number of positions (e.g., 197) | Cannot handle different resolutions natively |
| Resolution Mismatch | Training at 224×224 (196 patches), inference at 384×384 (576 patches) requires interpolation | Performance degradation at non-training resolutions |
| Interpolation Artifacts | Bicubic interpolation of position embeddings introduces artifacts | Especially problematic for large resolution changes |
| No Translation Invariance | Position (3,5) and (10,5) have independent embeddings | Must learn spatial patterns at every position separately |
| Data Hungry | Needs sufficient training data to learn meaningful position patterns | May underfit with limited data |
Resolution Transfer Protocol
When fine-tuning a ViT at a different resolution than pretraining:
1. Reshape 1D position embeddings to 2D grid: (N,) → (H_train, W_train).
2. Apply bicubic interpolation to new grid: (H_train, W_train) → (H_new, W_new).
3. Flatten back to 1D: (H_new × W_new,).
4. Fine-tune with the interpolated position embeddings (typically with lower learning rate for positions).
Learned Position Embeddings vs. Alternatives
| Method | Learned | Resolution Flexible | Translation Invariant | Parameters |
|--------|---------|--------------------|--------------------|-----------|
| Learned Absolute | Yes | No | No | N × D |
| Sinusoidal Fixed | No | Partially | No | 0 |
| Relative Bias | Yes | Yes (within window) | Yes | (2M-1)² |
| CPE (Convolutional) | Yes | Yes | Yes | 9C |
| RoPE | No | Yes | Yes | 0 |
| No Position | — | Yes | Yes | 0 |
Learned position embeddings are the simplest and most intuitive spatial encoding for Vision Transformers — while newer alternatives offer better resolution flexibility and translation invariance, learned embeddings remain the default choice in many architectures due to their simplicity, strong baseline performance, and ease of implementation.