sliding window attention

Sliding window attention is an efficient attention mechanism that restricts each token to only attend to nearby tokens within a fixed window, reducing computational complexity from O(N²) to O(N×W) where W is the window size, enabling processing of very long sequences. Each token attends to W/2 tokens before and after it (or W tokens in one direction for causal attention). This local attention captures short-range dependencies efficiently while sacrificing global context. Sliding window attention can be stacked in multiple layers—with L layers and window size W, the effective receptive field grows to L×W, enabling long-range interactions through multiple hops. The approach is used in Longformer, which combines sliding window attention with global attention on special tokens, and in models like Mistral 7B. Sliding window attention enables context lengths of 32K-128K tokens with manageable computation. The technique trades off global attention's ability to directly model long-range dependencies for computational efficiency. Sliding windows can be combined with other efficient attention mechanisms like sparse attention or linear attention for further scaling.

Want to learn more?