Compressive Transformer

Compressive Transformer is the long-range transformer architecture that extends context access through a hierarchical memory system — compressing older attention memories into progressively smaller representations rather than discarding them, enabling the model to reference thousands of tokens of history with bounded memory cost — the architecture that demonstrated how learned compression functions can preserve long-range information that fixed-window transformers simply cannot access.

What Is the Compressive Transformer?

- Definition: An extension of the Transformer-XL architecture that adds a compressed memory tier — when active memories (recent tokens) age out of the attention window, they are compressed into fewer, denser representations rather than being discarded, maintaining access to long-range context.
- Three Memory Tiers: (1) Active memory — the most recent tokens with full-resolution attention (standard transformer window), (2) Compressed memory — older tokens compressed into fewer representations via learned compression functions, (3) Discarded — only the oldest compressed memories are eventually evicted.
- Compression Functions: Old memories are compressed using learned functions — strided convolution (pool groups of n memories into 1), attention-based pooling (weighted combination), or max pooling — reducing sequence-axis memory by a factor of n while preserving the most important information.
- O(n) Memory Complexity: Total memory grows linearly with sequence length (through compression) rather than quadratically — enabling processing of sequences far longer than the attention window.

Why Compressive Transformer Matters

- Extended Context: Standard transformers can attend to at most window_size tokens; Compressive Transformer accesses n × window_size tokens of history at the cost of compressed (lower resolution) representation of older content.
- Graceful Information Decay: Rather than a hard cutoff where information beyond the window is completely lost, information degrades gradually through compression — recent context is high-resolution, older context is lower-resolution but still accessible.
- Bounded Memory: Unlike approaches that store all past tokens, Compressive Transformer maintains a fixed-size memory buffer regardless of sequence length — practical for deployment on memory-constrained hardware.
- Long-Document Understanding: Tasks requiring understanding of book-length texts (summarization, QA over long documents) benefit from compressed access to earlier content.
- Foundation for Hierarchical Memory: Established the design pattern of multi-tier memory with different resolution levels — influencing subsequent architectures like Memorizing Transformers and focused transformer variants.

Compressive Transformer Architecture

Memory Management:
- Attention window: most recent m tokens with full self-attention.
- When new tokens arrive, oldest active memories are evicted to compression buffer.
- Compression function reduces c memories to 1 compressed representation (compression ratio c).
- Compressed memories accumulate in compressed memory bank (fixed max size).

Compression Functions:
- Strided Convolution: 1D conv with stride c along the sequence axis — preserves learnable local summaries.
- Attention Pooling: Cross-attention from a single query to c memories — learns content-aware summarization.
- Max Pooling: Element-wise max across c memories — retains strongest activation signals.
- Mean Pooling: Simple averaging — baseline compression method.

Memory Hierarchy Parameters

| Tier | Size | Resolution | Age | Access |
|------|------|-----------|-----|--------|
| Active Memory | m tokens | Full | Recent | Direct attention |
| Compressed Memory | m/c tokens | Compressed | Older | Cross-attention |
| Effective Context | m + m = 2m tokens equiv. | Mixed | Full range | 2× versus Transformer-XL |

Compressive Transformer is the architectural proof that memory doesn't have to be all-or-nothing — demonstrating that learned compression of older context preserves sufficient information for long-range tasks while maintaining the bounded compute that makes deployment practical, pioneering the hierarchical memory design pattern adopted by subsequent efficient transformer architectures.

Want to learn more?