Home Knowledge Base BigBird

BigBird is a sparse attention transformer that combines three attention patterns — local sliding window, global tokens, and random connections — to achieve O(n) complexity while provably preserving the universal approximation properties of full attention — enabling sequences of 4,096-8,192+ tokens on standard GPUs with theoretical guarantees (based on graph theory) that its sparse attention pattern can approximate any function that full attention can, a property that other sparse attention methods lacked.

What Is BigBird?

Three Attention Components

ComponentPatternPurposeComplexity
Local (Sliding Window)Each token attends to w nearest neighborsCapture local syntax and phrasesO(n × w)
Globalg designated tokens attend to/from ALL positionsLong-range information aggregationO(n × g)
RandomEach token attends to r randomly chosen positionsProbabilistic graph connectivity (theory requirement)O(n × r)

Total per-token attention: w + g + r positions (instead of n).

Why Random Connections Matter

Without Random (Local + Global only)With Random (BigBird)
Information must flow through global tokensDirect random links create shortcuts
Graph diameter limited by global token countRandom edges reduce graph diameter logarithmically
No universal approximation guaranteeProven universal approximator
Like a hub-and-spoke networkLike a small-world network

The random connections are the theoretical key — they ensure that information can flow between any two positions in O(log n) hops, which is necessary for the Turing completeness proof.

BigBird Variants

VariantGlobal Token TypeWhen to Use
BigBird-ITC (Internal Transformer Construction)Existing tokens designated as globalClassification, QA (input tokens are globally important)
BigBird-ETC (Extended Transformer Construction)Extra auxiliary tokens added as globalWhen no natural global tokens exist in input

BigBird vs Other Efficient Transformers

ModelAttention PatternTheoretical GuaranteeMax LengthComplexity
BigBirdLocal + Global + RandomUniversal approximation ✓4K-8KO(n)
LongformerLocal + Dilated + GlobalNo formal proof16KO(n)
ReformerLSH bucketingApproximate attention only64KO(n log n)
LinformerLow-rank projectionNo formal proofLongO(n)
PerformerRandom feature approximationApproximate kernel attentionLongO(n)

BigBird is the theoretically-grounded efficient transformer — combining local sliding window, global tokens, and random attention connections to achieve linear complexity with a formal proof of universal approximation, establishing that sparse attention need not sacrifice the expressive power of full attention while enabling 4-8× longer sequences on standard GPU hardware for genomics, long document NLP, and scientific computing applications.

bigbirdfoundation model

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.