Home Knowledge Base Token Merging (ToMe)

Token Merging (ToMe) is a training-free inference acceleration method for Vision Transformers that reduces computational cost by progressively combining redundant tokens at each transformer layer — identifying similar tokens via bipartite soft matching of their feature representations and replacing pairs of similar tokens with their weighted average, achieving 2–3× throughput improvement with less than 1% accuracy drop on ImageNet classification — introduced by Bolya et al. (Meta AI, 2023) as a remarkably effective inference optimization that requires no retraining, no architectural changes, and applies universally to any pretrained ViT-based model including DeiT, MAE, SAM, Stable Diffusion, and video transformers.

What Is Token Merging?

Why Token Merging Works

Performance Results

ModelBaseline ThroughputToMe ThroughputAccuracy Drop
DeiT-S1,411 img/s2,783 img/s (+97%)−0.2%
DeiT-B626 img/s1,280 img/s (+104%)−0.3%
ViT-H (MAE)85 img/s198 img/s (+133%)−0.2%
Stable Diffusion (ViT backbone)3.4 it/s5.4 it/s (+59%)Imperceptible

Applications and Extensions

Token Merging is the elegantly simple inference accelerator that Vision Transformers deserved — the observation that a pretrained model's own key representations can identify which tokens are redundant, enabling safe, lossless pruning of computational redundancy without retraining, fine-tuning, or architectural modification.

token merging

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.