Flash Attention

Flash Attention is the IO-aware exact attention algorithm that computes multi-head self-attention without materializing the full N×N attention matrix in GPU HBM — using tiling and online softmax to process attention in blocks that fit in SRAM, reducing memory usage from O(N²) to O(N) and achieving 2-4x wall-clock speedup over standard attention by minimizing HBM read/write operations.

The Memory Wall Problem

Standard attention computes Q·K^T (an N×N matrix), applies softmax, and multiplies by V. For sequence length N=8192 and batch×heads=128, the attention matrix is 128×8192×8192×2 bytes = 16 GB — it doesn't even fit in GPU HBM. Even for shorter sequences, writing and reading this matrix to/from HBM is the dominant cost, not the arithmetic.

How Flash Attention Works

1. Tiling: Divide Q, K, V matrices into blocks that fit in GPU SRAM (the register file and shared memory, ~20 MB on H100 vs. 80 GB HBM). Process attention tile-by-tile.
2. Online Softmax: The key innovation. Standard softmax requires knowing the maximum value across the entire row (for numerical stability). Flash Attention uses the online softmax algorithm: maintain running max and running sum, updating them as each K-block is processed. The final result is mathematically identical to standard softmax.
3. No Materialization: The N×N attention matrix is never fully formed in memory. Each Q-block × K-block partial attention score is computed in SRAM, softmax-weighted, multiplied by V-block, and accumulated — all without writing intermediate results to HBM.
4. Fused Kernel: The entire attention computation (QK^T, masking, softmax, dropout, AV) is fused into a single GPU kernel, eliminating multiple HBM round-trips that standard implementations require.

Performance Impact

| Metric | Standard Attention | Flash Attention |
|--------|-------------------|----------------|
| HBM Memory | O(N²) | O(N) |
| HBM Read/Write | O(N² × d) | O(N² × d² / SRAM_size) |
| Wall-clock (N=2K) | 1x | 2-3x faster |
| Wall-clock (N=16K) | OOM | Works, 5-10x faster than sparse approx |

Flash Attention 2 and 3

- FlashAttention-2: Better work partitioning across GPU thread blocks, reduced non-matmul FLOPs, improved warp-level parallelism. Achieves 50-73% of theoretical max FLOPS on A100.
- FlashAttention-3: Exploits H100-specific features (FP8 Tensor Cores, asynchronous memory operations, warpgroup-level programming) for further speedup. Supports FP8 attention for additional 2x throughput.

Impact on the Field

Flash Attention made long-context LLMs practical. Before FlashAttention, attention at N=8192 was prohibitively expensive. Now, production models routinely use 32K-128K context lengths because FlashAttention makes the IO cost manageable.

Flash Attention is the algorithm that removed the memory bottleneck from transformer attention — proving that the O(N²) compute cost of attention was acceptable all along; it was the O(N²) memory access cost that was the real problem, and that could be solved by never writing the matrix down.

Want to learn more?