Patch merging is the downsampling operation in hierarchical Vision Transformers that combines neighboring patches into larger, deeper feature representations — reintroducing the multi-scale pyramid structure of CNNs into transformer architectures, enabling progressive reduction of spatial resolution while increasing feature channel depth for efficient processing of high-resolution images.
What Is Patch Merging?
- Definition: A spatial downsampling operation that groups adjacent patches (typically 2×2 neighborhoods) and concatenates their feature vectors, then applies a linear projection to produce a merged representation with reduced spatial dimensions and increased channel depth.
- Swin Transformer: Patch merging was introduced as a core component of the Swin Transformer (Liu et al., 2021), creating a four-stage hierarchical architecture analogous to CNN feature pyramids (e.g., ResNet stages).
- Operation: Given feature maps of shape (H×W, C), group 2×2 adjacent tokens → concatenate to get (H/2 × W/2, 4C) → linear project to (H/2 × W/2, 2C).
- Multi-Scale Features: Each merging stage halves the spatial resolution and doubles the channel depth, creating feature maps at 1/4, 1/8, 1/16, and 1/32 of the original image resolution.
Why Patch Merging Matters
- Hierarchical Features: Dense prediction tasks (object detection, segmentation) require features at multiple scales — flat ViT produces only single-scale features, while patch merging enables multi-scale feature pyramids.
- Computational Efficiency: By reducing spatial resolution progressively, self-attention in later stages operates on fewer tokens — a 56×56 feature map (3136 tokens) becomes 7×7 (49 tokens) after three merging stages.
- FPN Compatibility: Hierarchical features from patch merging stages can be directly fed into Feature Pyramid Networks (FPN), enabling ViT backbones to plug into existing detection and segmentation frameworks (Mask R-CNN, Cascade R-CNN).
- CNN Design Wisdom: Decades of CNN research showed that gradual spatial reduction with increasing channel depth is optimal for visual feature learning — patch merging brings this principle to transformers.
- Resolution Scalability: The multi-scale design naturally handles different input resolutions without modifying the architecture.
Patch Merging Mechanism
Step 1 — Spatial Grouping:
- From the 2D token grid, select tokens at positions (i, j), (i+1, j), (i, j+1), (i+1, j+1) forming a 2×2 neighborhood.
Step 2 — Concatenation:
- Concatenate the four tokens' feature vectors along the channel dimension.
- Result: 4 vectors of dim C → 1 vector of dim 4C.
Step 3 — Linear Projection:
- Apply a linear layer: Linear(4C, 2C) to reduce the concatenated dimension.
- This learned projection decides how to optimally combine the four patches' information.
Step 4 — Output:
- Spatial resolution halved in both dimensions: (H/2, W/2).
- Channel dimension doubled: 2C.
- Total token count reduced by 4×.
Swin Transformer Stages with Patch Merging
| Stage | Resolution | Tokens | Channels | Window Size |
|-------|-----------|--------|----------|-------------|
| Stage 1 | H/4 × W/4 | 3136 | 96 | 7×7 |
| Merge 1 | H/8 × W/8 | 784 | 192 | 7×7 |
| Stage 2 | H/8 × W/8 | 784 | 192 | 7×7 |
| Merge 2 | H/16 × W/16 | 196 | 384 | 7×7 |
| Stage 3 | H/16 × W/16 | 196 | 384 | 7×7 |
| Merge 3 | H/32 × W/32 | 49 | 768 | 7×7 |
| Stage 4 | H/32 × W/32 | 49 | 768 | 7×7 |
Patch Merging Variants
- Standard (Swin): 2×2 concatenation + linear projection (most common).
- Convolutional Merging: Use a strided convolution (stride=2, kernel=2) instead of concatenation + linear — provides similar effect with slightly different learned features.
- Adaptive Merging: Token merging based on similarity rather than fixed spatial grouping (used in ToMe — Token Merging for efficient ViTs).
- Hierarchical ViT: PVT (Pyramid Vision Transformer) uses spatial reduction attention instead of explicit patch merging.
Patch merging is the architectural bridge between flat transformers and multi-scale CNNs — by progressively reducing spatial resolution and building hierarchical features, it enables Vision Transformers to excel at dense prediction tasks that require understanding images at multiple scales simultaneously.