Home Knowledge Base Flash Attention

Flash Attention is the IO-aware exact attention algorithm that computes multi-head self-attention without materializing the full N×N attention matrix in GPU HBM — using tiling and online softmax to process attention in blocks that fit in SRAM, reducing memory usage from O(N²) to O(N) and achieving 2-4x wall-clock speedup over standard attention by minimizing HBM read/write operations.

The Memory Wall Problem

Standard attention computes Q·K^T (an N×N matrix), applies softmax, and multiplies by V. For sequence length N=8192 and batch×heads=128, the attention matrix is 128×8192×8192×2 bytes = 16 GB — it doesn't even fit in GPU HBM. Even for shorter sequences, writing and reading this matrix to/from HBM is the dominant cost, not the arithmetic.

How Flash Attention Works

1. Tiling: Divide Q, K, V matrices into blocks that fit in GPU SRAM (the register file and shared memory, ~20 MB on H100 vs. 80 GB HBM). Process attention tile-by-tile. 2. Online Softmax: The key innovation. Standard softmax requires knowing the maximum value across the entire row (for numerical stability). Flash Attention uses the online softmax algorithm: maintain running max and running sum, updating them as each K-block is processed. The final result is mathematically identical to standard softmax. 3. No Materialization: The N×N attention matrix is never fully formed in memory. Each Q-block × K-block partial attention score is computed in SRAM, softmax-weighted, multiplied by V-block, and accumulated — all without writing intermediate results to HBM. 4. Fused Kernel: The entire attention computation (QK^T, masking, softmax, dropout, AV) is fused into a single GPU kernel, eliminating multiple HBM round-trips that standard implementations require.

Performance Impact

MetricStandard AttentionFlash Attention
HBM MemoryO(N²)O(N)
HBM Read/WriteO(N² × d)O(N² × d² / SRAM_size)
Wall-clock (N=2K)1x2-3x faster
Wall-clock (N=16K)OOMWorks, 5-10x faster than sparse approx

Flash Attention 2 and 3

Impact on the Field

Flash Attention made long-context LLMs practical. Before FlashAttention, attention at N=8192 was prohibitively expensive. Now, production models routinely use 32K-128K context lengths because FlashAttention makes the IO cost manageable.

Flash Attention is the algorithm that removed the memory bottleneck from transformer attention — proving that the O(N²) compute cost of attention was acceptable all along; it was the O(N²) memory access cost that was the real problem, and that could be solved by never writing the matrix down.

flash attention algorithmio aware attentiontiled attentionflash attention memoryfused attention kernel

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.