Home Knowledge Base Learned position embeddings

Learned position embeddings are trainable parameter vectors assigned to each spatial position in a Vision Transformer's input sequence — providing the model with spatial location information by adding a unique, learned vector to each patch token so the transformer can distinguish where in the image each patch originated.

What Are Learned Position Embeddings?

Why Learned Position Embeddings Matter

How Learned Position Embeddings Work

Training Phase:

What the Model Learns:

Limitations of Learned Position Embeddings

LimitationDescriptionImpact
Fixed Sequence LengthTrained for specific number of positions (e.g., 197)Cannot handle different resolutions natively
Resolution MismatchTraining at 224×224 (196 patches), inference at 384×384 (576 patches) requires interpolationPerformance degradation at non-training resolutions
Interpolation ArtifactsBicubic interpolation of position embeddings introduces artifactsEspecially problematic for large resolution changes
No Translation InvariancePosition (3,5) and (10,5) have independent embeddingsMust learn spatial patterns at every position separately
Data HungryNeeds sufficient training data to learn meaningful position patternsMay underfit with limited data

Resolution Transfer Protocol

When fine-tuning a ViT at a different resolution than pretraining: 1. Reshape 1D position embeddings to 2D grid: (N,) → (H_train, W_train). 2. Apply bicubic interpolation to new grid: (H_train, W_train) → (H_new, W_new). 3. Flatten back to 1D: (H_new × W_new,). 4. Fine-tune with the interpolated position embeddings (typically with lower learning rate for positions).

Learned Position Embeddings vs. Alternatives

MethodLearnedResolution FlexibleTranslation InvariantParameters
Learned AbsoluteYesNoNoN × D
Sinusoidal FixedNoPartiallyNo0
Relative BiasYesYes (within window)Yes(2M-1)²
CPE (Convolutional)YesYesYes9C
RoPENoYesYes0
No PositionYesYes0

Learned position embeddings are the simplest and most intuitive spatial encoding for Vision Transformers — while newer alternatives offer better resolution flexibility and translation invariance, learned embeddings remain the default choice in many architectures due to their simplicity, strong baseline performance, and ease of implementation.

learned position embeddingscomputer vision

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.