Shifted window attention

Shifted window attention is the cross-window communication mechanism in Swin Transformer that shifts the window partition grid by half a window size between consecutive transformer layers — enabling information flow across window boundaries while maintaining the computational efficiency of local window attention, effectively providing global context through alternating local computations.

What Is Shifted Window Attention?

- Definition: A technique where the spatial partitioning of attention windows is offset by (M/2, M/2) pixels between consecutive transformer layers, so that tokens at the boundary of one layer's windows are placed in the interior of the next layer's windows, enabling cross-boundary information exchange.
- Swin Transformer Core: The defining innovation of the Swin Transformer (Liu et al., 2021, "Hierarchical Vision Transformer using Shifted Windows") that solves the isolation problem of non-overlapping window attention.
- Alternating Pattern: Layer L uses regular window partition. Layer L+1 shifts the partition by (⌊M/2⌋, ⌊M/2⌋). Layer L+2 returns to regular partition. This alternation continues through all layers.
- Effective Global Receptive Field: After just a few alternating layers, information can propagate across the entire image through successive cross-window connections.

Why Shifted Window Attention Matters

- Breaks Window Isolation: Without shifting, tokens in different windows can never interact — shifted windows create bridges between previously isolated regions.
- Maintains Linear Complexity: The shifting operation itself has zero computational cost — it's just a change in how tokens are grouped, not an additional attention computation.
- Equivalent to Cross-Window Attention: Mathematically, alternating regular and shifted windows achieves similar information flow to overlapping windows or cross-window attention, but with lower implementation complexity.
- Hierarchical Global Context: Combined with patch merging (spatial downsampling), shifted windows enable global context to emerge naturally — early layers handle local features, later layers (with reduced spatial resolution) handle global relationships.
- SOTA Performance: Swin Transformer with shifted window attention achieved state-of-the-art results on ImageNet classification, COCO detection, and ADE20K segmentation upon release.

How Shifted Window Attention Works

Regular Window (Layer L):
- Feature map partitioned into non-overlapping M×M windows.
- Example with M=4 on an 8×8 map: 4 windows, each 4×4 = 16 tokens.
- Self-attention computed independently within each window.

Shifted Window (Layer L+1):
- Window grid shifted by (M/2, M/2) = (2, 2) pixels.
- New window boundaries now cross the centers of the previous windows.
- Tokens that were at the edges of regular windows are now in the middle of shifted windows.
- Cross-boundary information flows naturally through attention within the new windows.

Efficient Masking Implementation:
- Naive shifting creates irregular windows at image borders (different sizes).
- Cyclic Shift: Instead of padding, the feature map is cyclically shifted, creating full-size windows everywhere.
- Attention Mask: A mask prevents tokens from different original spatial regions from attending to each other within the same shifted window.
- This approach maintains a uniform window count and avoids padding overhead.

Information Flow Example

| Layer | Window Config | Cross-Window Info |
|-------|-------------|-------------------|
| Layer 1 | Regular windows | None — isolated |
| Layer 2 | Shifted windows | Adjacent windows connected |
| Layer 3 | Regular windows | 2-hop connections form |
| Layer 4 | Shifted windows | 3-hop connections — near global |
| Layer 5+ | Alternating | Effectively global receptive field |

Swin Transformer Architecture

| Stage | Layers | Window | Resolution | Cross-Window |
|-------|--------|--------|-----------|-------------|
| Stage 1 | 2 | 7×7 | 56×56 | 1 shifted layer |
| Stage 2 | 2 | 7×7 | 28×28 | 1 shifted layer |
| Stage 3 | 6-18 | 7×7 | 14×14 | 3-9 shifted layers |
| Stage 4 | 2 | 7×7 | 7×7 | 1 shifted layer (global) |

Performance Impact

| Model | Attention Type | ImageNet Top-1 | FLOPs |
|-------|---------------|----------------|-------|
| ViT-B/16 | Global | 77.9% | 17.6G |
| DeiT-B | Global + distill | 83.4% | 17.6G |
| Swin-B | Shifted window | 83.5% | 15.4G |
| Swin-L | Shifted window | 87.3% | 34.5G |

Shifted window attention is the elegant solution to the locality-efficiency tradeoff in Vision Transformers — by simply alternating window positions between layers, Swin Transformer achieves global information flow with purely local computation, proving that cleverness in architecture design can be more powerful than brute-force compute.

Want to learn more?