Hyena

Keywords: hyena,llm architecture

Hyena is a subquadratic attention replacement that combines long convolutions (computed via FFT) with element-wise data-dependent gating — achieving O(n log n) complexity instead of attention's O(n²) while maintaining the data-dependent processing crucial for language understanding, matching transformer quality on language modeling at 1-2B parameter scale with 100× speedup on 64K-token contexts, representing a fundamentally different architectural path beyond the attention mechanism.

What Is Hyena?

- Definition: A sequence modeling operator (Poli et al., 2023) that replaces the attention mechanism with a composition of long implicit convolutions (parameterized by small neural networks, computed via FFT) and element-wise multiplicative gating that conditions processing on the input data — achieving the "data-dependent" property of attention without the quadratic cost.
- The Motivation: Attention is O(n²) in sequence length, and all efficient attention variants (FlashAttention, sparse attention, linear attention) are either still quadratic in FLOPs, approximate, or lose quality. Hyena asks: can we build a fundamentally subquadratic operator that matches attention quality?
- The Answer: Long convolutions provide global receptive fields in O(n log n) via FFT, and data-dependent gating provides the input-conditional processing that makes attention so powerful. The combination achieves both.

The Hyena Operator

| Component | Function | Analogy to Attention |
|-----------|---------|---------------------|
| Implicit Convolution Filters | Parameterize convolution kernels with small neural networks, apply via FFT | Like the attention pattern (which tokens interact) |
| Data-Dependent Gating | Element-wise multiplication gated by the input | Like attention weights being conditioned on Q and K |
| FFT Computation | Convolution in frequency domain: O(n log n) | Replaces the O(n²) QK^T attention matrix |

Hyena computation: h = (v ⊙ filter₁(x)) ⊙ (x ⊙ filter₂(x))

Where ⊙ is element-wise multiplication and filters are implicitly parameterized.

Complexity Comparison

| Operator | Complexity | Data-Dependent? | Global Receptive Field? | Exact? |
|----------|-----------|----------------|------------------------|--------|
| Full Attention | O(n²) | Yes (QK^T) | Yes | Yes |
| FlashAttention | O(n²) FLOPs, O(n) memory | Yes | Yes | Yes |
| Linear Attention | O(n) | Approximate | Yes (kernel approx) | No |
| Hyena | O(n log n) | Yes (gating) | Yes (FFT convolution) | N/A (different operator) |
| S4/Mamba | O(n) or O(n log n) | Yes (selective) | Yes (SSM) | N/A (different operator) |
| Local Attention | O(n × w) | Yes | No (window only) | Yes (within window) |

Benchmark Results

| Benchmark | Transformer (baseline) | Hyena | Notes |
|-----------|----------------------|-------|-------|
| WikiText-103 (perplexity) | 18.7 (GPT-2 scale) | 18.9 | Within 1% quality |
| The Pile (perplexity) | Comparable | Comparable at 1-2B scale | Matches at moderate scale |
| Long-range Arena | Baseline | Competitive | Synthetic long-range benchmarks |
| Speed (64K context) | 1× (with FlashAttention) | ~100× faster | Dominant advantage at long contexts |

Hyena vs Related Subquadratic Architectures

| Model | Core Mechanism | Complexity | Maturity |
|-------|---------------|-----------|----------|
| Hyena | Implicit convolution + gating | O(n log n) | Research (2023) |
| Mamba (S6) | Selective State Space Model + hardware-aware scan | O(n) | Production-ready (2024) |
| RWKV | Linear attention + recurrence | O(n) | Open-source, active community |
| RetNet | Retention mechanism (parallel + recurrent) | O(n) | Research (Microsoft) |

Hyena represents a fundamentally new approach to sequence modeling beyond attention — replacing the O(n²) attention matrix with O(n log n) FFT-based implicit convolutions and data-dependent gating, matching transformer quality at moderate scale while delivering 100× speedups on long contexts, demonstrating that the attention mechanism may not be the only path to high-quality language understanding and opening the door to sub-quadratic foundation models.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT