Patch merging

Keywords: patch merging in vit, computer vision

Patch merging is the downsampling operation in hierarchical Vision Transformers that combines neighboring patches into larger, deeper feature representations — reintroducing the multi-scale pyramid structure of CNNs into transformer architectures, enabling progressive reduction of spatial resolution while increasing feature channel depth for efficient processing of high-resolution images.

What Is Patch Merging?

- Definition: A spatial downsampling operation that groups adjacent patches (typically 2×2 neighborhoods) and concatenates their feature vectors, then applies a linear projection to produce a merged representation with reduced spatial dimensions and increased channel depth.
- Swin Transformer: Patch merging was introduced as a core component of the Swin Transformer (Liu et al., 2021), creating a four-stage hierarchical architecture analogous to CNN feature pyramids (e.g., ResNet stages).
- Operation: Given feature maps of shape (H×W, C), group 2×2 adjacent tokens → concatenate to get (H/2 × W/2, 4C) → linear project to (H/2 × W/2, 2C).
- Multi-Scale Features: Each merging stage halves the spatial resolution and doubles the channel depth, creating feature maps at 1/4, 1/8, 1/16, and 1/32 of the original image resolution.

Why Patch Merging Matters

- Hierarchical Features: Dense prediction tasks (object detection, segmentation) require features at multiple scales — flat ViT produces only single-scale features, while patch merging enables multi-scale feature pyramids.
- Computational Efficiency: By reducing spatial resolution progressively, self-attention in later stages operates on fewer tokens — a 56×56 feature map (3136 tokens) becomes 7×7 (49 tokens) after three merging stages.
- FPN Compatibility: Hierarchical features from patch merging stages can be directly fed into Feature Pyramid Networks (FPN), enabling ViT backbones to plug into existing detection and segmentation frameworks (Mask R-CNN, Cascade R-CNN).
- CNN Design Wisdom: Decades of CNN research showed that gradual spatial reduction with increasing channel depth is optimal for visual feature learning — patch merging brings this principle to transformers.
- Resolution Scalability: The multi-scale design naturally handles different input resolutions without modifying the architecture.

Patch Merging Mechanism

Step 1 — Spatial Grouping:
- From the 2D token grid, select tokens at positions (i, j), (i+1, j), (i, j+1), (i+1, j+1) forming a 2×2 neighborhood.

Step 2 — Concatenation:
- Concatenate the four tokens' feature vectors along the channel dimension.
- Result: 4 vectors of dim C → 1 vector of dim 4C.

Step 3 — Linear Projection:
- Apply a linear layer: Linear(4C, 2C) to reduce the concatenated dimension.
- This learned projection decides how to optimally combine the four patches' information.

Step 4 — Output:
- Spatial resolution halved in both dimensions: (H/2, W/2).
- Channel dimension doubled: 2C.
- Total token count reduced by 4×.

Swin Transformer Stages with Patch Merging

| Stage | Resolution | Tokens | Channels | Window Size |
|-------|-----------|--------|----------|-------------|
| Stage 1 | H/4 × W/4 | 3136 | 96 | 7×7 |
| Merge 1 | H/8 × W/8 | 784 | 192 | 7×7 |
| Stage 2 | H/8 × W/8 | 784 | 192 | 7×7 |
| Merge 2 | H/16 × W/16 | 196 | 384 | 7×7 |
| Stage 3 | H/16 × W/16 | 196 | 384 | 7×7 |
| Merge 3 | H/32 × W/32 | 49 | 768 | 7×7 |
| Stage 4 | H/32 × W/32 | 49 | 768 | 7×7 |

Patch Merging Variants

- Standard (Swin): 2×2 concatenation + linear projection (most common).
- Convolutional Merging: Use a strided convolution (stride=2, kernel=2) instead of concatenation + linear — provides similar effect with slightly different learned features.
- Adaptive Merging: Token merging based on similarity rather than fixed spatial grouping (used in ToMe — Token Merging for efficient ViTs).
- Hierarchical ViT: PVT (Pyramid Vision Transformer) uses spatial reduction attention instead of explicit patch merging.

Patch merging is the architectural bridge between flat transformers and multi-scale CNNs — by progressively reducing spatial resolution and building hierarchical features, it enables Vision Transformers to excel at dense prediction tasks that require understanding images at multiple scales simultaneously.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT