Conditional position encoding (CPE)

Conditional position encoding (CPE) is a dynamic position embedding method that generates position information from the input features themselves using a convolutional layer — enabling Vision Transformers to handle variable input resolutions at inference time without the position embedding mismatch that plagues standard learned absolute position embeddings.

What Is Conditional Position Encoding?

- Definition: A position encoding mechanism that uses a depth-wise convolutional layer applied to the input feature maps to generate position-dependent features that are added to the token representations, rather than using fixed or learned lookup tables indexed by position.
- Dynamic Generation: Unlike absolute position embeddings that are fixed after training, CPE generates position information conditioned on the actual input features and spatial arrangement — the encoding adapts to the content and resolution of each input.
- Implementation: Typically a depth-wise Conv2D with 3×3 kernel applied to the 2D-reshaped token features, producing position-aware features that are added back to the tokens before each transformer block.
- Origin: Introduced in CPVT (Conditional Position encoding Vision Transformer, Chu et al., 2021) and adopted by subsequent architectures including PVTv2 and Twins.

Why CPE Matters

- Resolution Flexibility: Standard ViT with learned position embeddings trained at 224×224 produces 196 position vectors — at 384×384 inference (576 patches), position embeddings must be interpolated, causing performance degradation. CPE generates appropriate position encodings for any resolution natively.
- No Interpolation Needed: Since CPE derives position information from the spatial arrangement of features through convolution, it naturally adapts to any spatial dimension without interpolation artifacts.
- Variable-Size Inputs: Applications like object detection and video processing require handling diverse input sizes — CPE eliminates the fixed-resolution constraint of absolute position embeddings.
- Zero-Padding Position Signal: The convolutional layer's zero-padding at feature map boundaries implicitly encodes absolute position information (tokens near edges see padding, center tokens don't), providing both relative and absolute position cues.
- Plug-and-Play: CPE can be inserted into any ViT architecture as a simple convolutional layer between transformer blocks, requiring minimal architectural changes.

How CPE Works

Standard Position Embedding (ViT):
- Fixed position vectors: x = patch_embed + pos_embed[i] for position i.
- Position information is static and resolution-dependent.

Conditional Position Encoding (CPE):
- Reshape tokens back to 2D spatial layout: (B, N, C) → (B, C, H, W).
- Apply depth-wise Conv2D(C, C, kernel=3, padding=1, groups=C).
- Reshape back: (B, C, H, W) → (B, N, C).
- Add to token features: x = x + CPE(x).

Key Insight: The convolution's receptive field provides local spatial context, and the zero-padding at boundaries provides absolute position awareness — together, they give the transformer both relative and absolute spatial information without explicit position embeddings.

CPE vs. Other Position Encodings

| Method | Resolution Flexible | Content Adaptive | Parameters | Implementation |
|--------|-------------------|-----------------|-----------|---------------|
| Learned Absolute | No (need interpolation) | No | N × D | Lookup table |
| Sinusoidal | Partially | No | 0 | Mathematical |
| Relative Bias | Yes (within windows) | No | (2M-1)² | Bias table |
| RoPE | Yes | No | 0 | Rotation |
| CPE | Yes (fully) | Yes | 9C (3×3 DW conv) | Convolution |

CPE Design Choices

- Kernel Size: 3×3 is standard — provides sufficient spatial context with minimal parameters. Larger kernels (5×5, 7×7) provide wider context but diminishing returns.
- Depth-Wise Convolution: Uses groups=C for efficiency — each channel processes its spatial neighborhood independently, minimizing parameter count and compute.
- Placement: Can be placed before each transformer block, after attention, or only at the first block — before each block is most common and effective.
- With vs. Without Absolute Position: Some architectures combine CPE with absolute position embeddings for best performance at the training resolution while maintaining flexibility at other resolutions.

Architectures Using CPE

- CPVT: Original CPE paper — replaces absolute position embeddings entirely with conv-based conditional encoding.
- PVTv2: Pyramid Vision Transformer v2 uses CPE for its hierarchical multi-scale architecture.
- Twins-SVT: Combines local window attention with CPE for spatial-reduction attention transformers.
- PoolFormer: Uses CPE-style position encoding in its MetaFormer architecture.

Conditional position encoding is the most flexible position encoding method for real-world Vision Transformer deployment — by generating spatial information dynamically from the input itself, CPE frees transformers from the fixed-resolution training constraint and enables seamless deployment across diverse image sizes and aspect ratios.

Conditional position encoding (CPE)

Want to learn more?