Token Merging (ToMe)

Token Merging (ToMe) is a training-free inference acceleration method for Vision Transformers that reduces computational cost by progressively combining redundant tokens at each transformer layer — identifying similar tokens via bipartite soft matching of their feature representations and replacing pairs of similar tokens with their weighted average, achieving 2–3× throughput improvement with less than 1% accuracy drop on ImageNet classification — introduced by Bolya et al. (Meta AI, 2023) as a remarkably effective inference optimization that requires no retraining, no architectural changes, and applies universally to any pretrained ViT-based model including DeiT, MAE, SAM, Stable Diffusion, and video transformers.

What Is Token Merging?

- The Redundancy Problem: Vision Transformers split images into N patch tokens (e.g., 196 tokens for a 224×224 image with 16×16 patches). Many of these tokens represent visually similar or background regions and carry highly redundant information — yet all are processed through every attention layer at a cost quadratic in N.
- Token Merging Solution: At each transformer layer, before computing self-attention, identify the r most redundant token pairs using bipartite soft matching, then average each pair into a single merged token. After merging, the layer operates on N - r tokens instead of N.
- Bipartite Soft Matching: Tokens are split into two disjoint sets (alternating tokens). Each token in set A is matched to its most similar (by key vector dot product) token in set B. The r pairs with highest similarity scores are merged — averaging their values and summing their attention weights (or using a learned aggregation).
- Progressive Reduction: ToMe is applied at every layer, progressively reducing the token count — a transformer with 12 layers applying r=8 merges per layer reduces from 196 to 100 tokens by the final layer.
- No Training Required: Merged represents are compatible with the pretrained model's attention and MLP computations — no fine-tuning needed. ToMe "just works" on any pretrained ViT.

Why Token Merging Works

- Soft Information Preservation: Unlike token pruning (which discards tokens entirely), merging averages information from two tokens — no information is lost, only redundancy is eliminated. The averaged token carries the combined signal of both.
- Attention Score Tracking: ToMe tracks how many original tokens each merged token represents (a count) and scales attention outputs accordingly — ensuring the attention weighted sum correctly weights merged tokens.
- Architectural Alignment: The key-based similarity matching aligns with what attention already computes — similar-key tokens will attend to each other heavily anyway, so merging them early does not disrupt the attention structure.

Performance Results

| Model | Baseline Throughput | ToMe Throughput | Accuracy Drop |
|-------|-------------------|-----------------|---------------|
| DeiT-S | 1,411 img/s | 2,783 img/s (+97%) | −0.2% |
| DeiT-B | 626 img/s | 1,280 img/s (+104%) | −0.3% |
| ViT-H (MAE) | 85 img/s | 198 img/s (+133%) | −0.2% |
| Stable Diffusion (ViT backbone) | 3.4 it/s | 5.4 it/s (+59%) | Imperceptible |

Applications and Extensions

- Stable Diffusion Acceleration: ToMe for SD reduces the attention tokens in the U-Net's transformer blocks, providing 1.5–2× speedup in image generation with imperceptible quality change.
- Video Transformers: Temporal token merging (merging similar tokens across consecutive frames) achieves 5× speedup for video understanding models.
- SAM (Segment Anything): ToMe applied to SAM's image encoder reduces per-image encoding time significantly — enabling faster interactive segmentation.
- Training Efficiency: ToMe can also be applied during training to reduce memory and compute — enabling training of larger models in the same memory budget.

Token Merging is the elegantly simple inference accelerator that Vision Transformers deserved — the observation that a pretrained model's own key representations can identify which tokens are redundant, enabling safe, lossless pruning of computational redundancy without retraining, fine-tuning, or architectural modification.

Want to learn more?