Relative position bias

Relative position bias is a learned spatial encoding in Vision Transformers that captures the relative distance and direction between pairs of patches rather than their absolute positions — providing translation invariance so that spatial relationships like "nose is above mouth" hold regardless of where the face appears in the image, improving generalization and enabling flexible resolution handling.

What Is Relative Position Bias?

- Definition: A learnable bias term added to the attention logits that encodes the relative spatial offset between each pair of tokens in the attention computation, replacing or augmenting absolute position embeddings.
- Relative vs. Absolute: Absolute position embeddings assign a fixed vector to each spatial location (e.g., "position 5 = vector_5"). Relative position bias encodes relationships between positions (e.g., "3 steps right and 2 steps down = bias_value").
- Implementation: For a window of M×M tokens, the relative position between any two tokens ranges from -(M-1) to +(M-1) along each axis, creating a (2M-1) × (2M-1) bias table indexed by relative offset.
- Swin Transformer: Relative position bias is a core component of Swin Transformer, added directly to the attention scores before softmax: Attention = Softmax(QK^T/√d + B), where B is the relative position bias matrix.

Why Relative Position Bias Matters

- Translation Invariance: A patch at position (3,5) relating to a patch at (3,7) has the same relative offset as (10,5) relating to (10,7) — the model learns that "2 steps right" is the same relationship regardless of absolute position.
- Better Generalization: Models with relative position bias generalize better to unseen spatial configurations because they learn relationships rather than memorizing absolute positions.
- Resolution Flexibility: When transferring a model trained at 224×224 to 384×384, relative position biases can be interpolated naturally because the relative relationships (nearby, far, same-row) maintain their meaning.
- Empirical Superiority: Swin Transformer and subsequent work consistently show that relative position bias outperforms absolute position embeddings on classification, detection, and segmentation benchmarks.
- Window Attention Compatibility: Relative position bias naturally fits window-based attention — within each M×M window, the bias table is compact and efficiently indexed.

How Relative Position Bias Works

Bias Table Construction:
- For a window size M, relative positions range from -(M-1) to +(M-1) along each axis.
- Total unique relative positions: (2M-1) × (2M-1). For M=7: 13×13 = 169 learnable bias values.
- Each bias value is a scalar added to the corresponding attention logit.

Index Mapping:
- For token i at (row_i, col_i) and token j at (row_j, col_j):
- Relative row offset: Δrow = row_i - row_j + (M-1) (shifted to positive range)
- Relative col offset: Δcol = col_i - col_j + (M-1)
- Bias index: Δrow × (2M-1) + Δcol

Attention Computation:
- Standard: Attention = Softmax(QK^T / √d_k)
- With bias: Attention = Softmax(QK^T / √d_k + B)
- B is an M²×M² matrix populated from the (2M-1)²-entry bias table.

Comparison of Position Encoding Methods

| Method | Type | Translation Invariant | Resolution Flexible | Parameters |
|--------|------|----------------------|--------------------|-----------|
| Learned Absolute | Additive embedding | No | No (fixed length) | N × D |
| Sinusoidal Absolute | Fixed, no learning | No | Partially | 0 |
| Relative Position Bias | Attention bias | Yes | Yes (interpolate) | (2M-1)² per head |
| RoPE (Rotary) | Rotation in Q/K | Yes | Yes | 0 |
| Conditional (CPE) | Conv-based | Yes | Yes | Conv params |

Relative Position Bias Variants

- Per-Head Bias (Swin): Each attention head has its own bias table, allowing different heads to learn different spatial relationship patterns.
- Shared Bias: A single bias table shared across heads — fewer parameters, slightly lower performance.
- Continuous Bias (Log-CPB): Swin Transformer V2 uses a small MLP to generate bias values from continuous log-spaced coordinates, enabling better transfer across window sizes.
- 3D Relative Bias: Extended to video transformers by adding a temporal relative position dimension.

Relative position bias is the position encoding method of choice for modern Vision Transformers — by learning how patches relate to each other rather than where they are in absolute terms, it provides the spatial understanding transformers need while maintaining the flexibility to generalize across resolutions and spatial configurations.

Want to learn more?