Home Knowledge Base Patch merging

Patch merging is the downsampling operation in hierarchical Vision Transformers that combines neighboring patches into larger, deeper feature representations — reintroducing the multi-scale pyramid structure of CNNs into transformer architectures, enabling progressive reduction of spatial resolution while increasing feature channel depth for efficient processing of high-resolution images.

What Is Patch Merging?

Why Patch Merging Matters

Patch Merging Mechanism

Step 1 — Spatial Grouping:

Step 2 — Concatenation:

Step 3 — Linear Projection:

Step 4 — Output:

Swin Transformer Stages with Patch Merging

StageResolutionTokensChannelsWindow Size
Stage 1H/4 × W/43136967×7
Merge 1H/8 × W/87841927×7
Stage 2H/8 × W/87841927×7
Merge 2H/16 × W/161963847×7
Stage 3H/16 × W/161963847×7
Merge 3H/32 × W/32497687×7
Stage 4H/32 × W/32497687×7

Patch Merging Variants

Patch merging is the architectural bridge between flat transformers and multi-scale CNNs — by progressively reducing spatial resolution and building hierarchical features, it enables Vision Transformers to excel at dense prediction tasks that require understanding images at multiple scales simultaneously.

patch merging in vitcomputer vision

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.