Home Knowledge Base Axial Attention

Axial Attention is the factorized attention strategy that alternates row-wise and column-wise self-attention to cover entire images without quadratic compute — by sweeping first along the height axis and then along the width axis, the layer retains full-field context while shrinking complexity to O(HW(H+W)), which lets Vision Transformers scale to megapixel inputs for satellite, microscopy, and clinical imagery without blowing up memory.

What Is Axial Attention?

Why Axial Attention Matters

Axis Configurations

Row-then-Column:

Column-then-Row:

Hybrid:

How It Works

Step 1: Project tokens to queries, keys, and values and reshape them into (axis_length, channel), then run the first attention pass along rows, normalizing by sqrt(dk) and applying softmax with per-row masks.

Step 2: Feed row outputs into the second pass that attends along columns, optionally including learned relative offsets, before adding the standard feed-forward module and layer norm.

Comparison / Alternatives

AspectAxialGlobal (Full)Window + Shift
ComplexityO(HW(H+W))O((HW)^2)O(HWw^2) with window size w
Receptive FieldTwo-hop globalDirect globalPatch-clustered, requires shifts
Memory PressureLinearQuadraticModerate
Best Use CaseGigapixel scenesModerate-resolution tasksEfficiency + locality

Tools & Platforms

Axial attention is the existential tool for scaling transformers to dense, high-resolution imaging tasks — it keeps every patch in play without ever materializing an enormous attention matrix, so practical deployments can see the whole field without compromising training budgets.

axial attention in vitcomputer vision

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.