Home Knowledge Base MetaFormer

MetaFormer is the architectural hypothesis proposing that the transformer's effectiveness comes primarily from its general architecture (alternating token mixing and channel mixing blocks) rather than from the specific attention mechanism — demonstrated by replacing self-attention with simple average pooling (PoolFormer) and still achieving competitive ImageNet performance — a paradigm-shifting finding that reframes the transformer's success as an architectural topology discovery rather than an attention mechanism discovery.

What Is MetaFormer?

Why MetaFormer Matters

Token Mixer Experiments

Token MixerParametersImageNet Top-1Complexity
Average Pooling (PoolFormer)082.1%$O(n)$
Random MatrixFixed random~80%$O(n)$
Depthwise Convolution$K^2C$ per layer83.2%$O(Kn)$
Self-Attention$4d^2$ per layer83.5%$O(n^2)$
Fourier Transform081.4%$O(n log n)$
Spatial MLP (MLP-Mixer)$n^2$82.7%$O(n^2)$

MetaFormer Architecture Hierarchy

The MetaFormer framework reveals a hierarchy of token mixing strategies:

Implications for Model Design

MetaFormer is the finding that the transformer's magic lies not in attention but in its architectural blueprint — revealing that alternating token mixing with channel processing, wrapped in residual connections and normalization, is a general-purpose architecture substrate upon which many specific mixing mechanisms can achieve surprisingly similar results.

metaformerllm architecture

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.