Local window attention

Local window attention is the computational efficiency strategy that restricts self-attention computation to small fixed-size local windows rather than the full image — reducing the quadratic complexity of standard global self-attention from O(N²) to O(N) linear complexity with respect to image size, making transformer processing of high-resolution images computationally feasible.

What Is Local Window Attention?

- Definition: A modified self-attention mechanism where each token only attends to other tokens within the same fixed-size spatial window (typically 7×7 or 8×8 tokens), rather than attending to every token in the entire image.
- Swin Transformer: Introduced as the core attention mechanism in the Swin Transformer (Liu et al., 2021), replacing global self-attention with window-based attention partitioned into non-overlapping local regions.
- Complexity Reduction: For an image with N patches, global attention costs O(N²) — for a 56×56 feature map (3136 tokens), that's ~9.8 million attention computations. Window attention with 7×7 windows costs O(49 × N/49 × 49) = O(49N), which is linear in N.
- Locality Principle: In natural images, nearby pixels are more correlated than distant pixels — local attention captures the most informative relationships while discarding less useful long-range computations.

Why Local Window Attention Matters

- High-Resolution Processing: Global self-attention is impractical for high-resolution images — a 1024×1024 image with 4×4 patches produces 65,536 tokens, making O(N²) attention (~4.3 billion operations) infeasible. Window attention reduces this to manageable levels.
- Linear Scaling: Compute cost scales linearly with image resolution instead of quadratically, enabling ViTs to process images at any resolution without a compute explosion.
- Dense Prediction Tasks: Object detection and segmentation require high-resolution feature maps — window attention makes transformer backbones practical for these tasks.
- Memory Efficiency: Memory usage also scales linearly instead of quadratically, enabling larger batch sizes and higher resolution training on the same hardware.
- Competitive Performance: Despite limiting attention scope, window-based transformers achieve state-of-the-art performance by combining local attention with cross-window information exchange mechanisms.

How Local Window Attention Works

Step 1 — Window Partition:
- Divide the H×W feature map into non-overlapping windows of size M×M (typically M=7).
- For a 56×56 feature map with M=7: 8×8 = 64 windows, each containing 49 tokens.

Step 2 — Independent Attention:
- Compute standard multi-head self-attention independently within each window.
- Each token attends to all M² tokens in its window.
- Cost per window: O(M⁴) in FLOPs.

Step 3 — Output Assembly:
- Reassemble the independently processed windows back into the full feature map.
- No information crosses window boundaries in this step.

Complexity Comparison

| Attention Type | Complexity | 56×56 Feature Map | 112×112 Feature Map |
|---------------|-----------|-------------------|---------------------|
| Global | O(N²) | 9.8M ops | 157M ops |
| Window (M=7) | O(M² × N) | 154K ops | 614K ops |
| Speedup | — | 64× | 256× |

Limitations and Solutions

- No Cross-Window Communication: Tokens in different windows cannot interact — solved by shifted window attention (alternating window positions between layers).
- Fixed Receptive Field: Each layer only sees M×M tokens — stacking multiple layers with shifted windows gradually expands the effective receptive field.
- Window Boundary Artifacts: Objects split across window boundaries may not be properly modeled — shifted windows and overlapping windows mitigate this.
- Global Context Missing: Some tasks require global context that pure local attention cannot provide — hybrid architectures add occasional global attention layers (e.g., every 4th layer).

Local Window Attention Variants

- Swin Transformer: Non-overlapping windows with shifted window attention for cross-window communication.
- Neighborhood Attention (NAT): Each token attends to its K nearest spatial neighbors, providing a sliding window effect.
- Dilated Window Attention: Windows with gaps (dilation) to increase receptive field without increasing window size.
- Axial Attention: Factorize 2D attention into separate row and column attention, providing global attention along each axis with linear cost.

Local window attention is the key efficiency breakthrough that made Vision Transformers practical for real-world vision tasks — by recognizing that most visual information is local, window attention achieves near-global understanding at a fraction of the computational cost.

Want to learn more?