BigBird is a sparse attention transformer that combines three attention patterns — local sliding window, global tokens, and random connections — to achieve O(n) complexity while provably preserving the universal approximation properties of full attention — enabling sequences of 4,096-8,192+ tokens on standard GPUs with theoretical guarantees (based on graph theory) that its sparse attention pattern can approximate any function that full attention can, a property that other sparse attention methods lacked.
What Is BigBird?
- Definition: A transformer architecture (Zaheer et al., 2020, Google Research) that replaces full O(n²) attention with a sparse pattern combining three components: local sliding window attention, a set of global tokens, and random attention connections — with a theoretical proof that this combination is a universal approximator of sequence-to-sequence functions.
- The Theoretical Breakthrough: Other sparse attention methods (Longformer, Sparse Transformer) were empirically effective but lacked theoretical justification. BigBird proved (using graph theory and the Turing completeness of the attention mechanism) that its specific combination of local + global + random attention can simulate any full attention computation.
- The Practical Impact: Process sequences 8× longer than BERT (4K-8K vs 512 tokens) with only 3-4× the compute — enabling genomics (DNA sequences), long document NLP, and scientific text processing.
Three Attention Components
| Component | Pattern | Purpose | Complexity |
|-----------|--------|---------|-----------|
| Local (Sliding Window) | Each token attends to w nearest neighbors | Capture local syntax and phrases | O(n × w) |
| Global | g designated tokens attend to/from ALL positions | Long-range information aggregation | O(n × g) |
| Random | Each token attends to r randomly chosen positions | Probabilistic graph connectivity (theory requirement) | O(n × r) |
Total per-token attention: w + g + r positions (instead of n).
Why Random Connections Matter
| Without Random (Local + Global only) | With Random (BigBird) |
|--------------------------------------|----------------------|
| Information must flow through global tokens | Direct random links create shortcuts |
| Graph diameter limited by global token count | Random edges reduce graph diameter logarithmically |
| No universal approximation guarantee | Proven universal approximator |
| Like a hub-and-spoke network | Like a small-world network |
The random connections are the theoretical key — they ensure that information can flow between any two positions in O(log n) hops, which is necessary for the Turing completeness proof.
BigBird Variants
| Variant | Global Token Type | When to Use |
|---------|-----------------|-------------|
| BigBird-ITC (Internal Transformer Construction) | Existing tokens designated as global | Classification, QA (input tokens are globally important) |
| BigBird-ETC (Extended Transformer Construction) | Extra auxiliary tokens added as global | When no natural global tokens exist in input |
BigBird vs Other Efficient Transformers
| Model | Attention Pattern | Theoretical Guarantee | Max Length | Complexity |
|-------|------------------|---------------------|-----------|-----------|
| BigBird | Local + Global + Random | Universal approximation ✓ | 4K-8K | O(n) |
| Longformer | Local + Dilated + Global | No formal proof | 16K | O(n) |
| Reformer | LSH bucketing | Approximate attention only | 64K | O(n log n) |
| Linformer | Low-rank projection | No formal proof | Long | O(n) |
| Performer | Random feature approximation | Approximate kernel attention | Long | O(n) |
BigBird is the theoretically-grounded efficient transformer — combining local sliding window, global tokens, and random attention connections to achieve linear complexity with a formal proof of universal approximation, establishing that sparse attention need not sacrifice the expressive power of full attention while enabling 4-8× longer sequences on standard GPU hardware for genomics, long document NLP, and scientific computing applications.