Distillation token is a special learnable embedding used in DeiT (Data-efficient Image Transformers) that learns from a CNN teacher model's predictions — enabling Vision Transformers to achieve strong performance with significantly less training data by combining the transformer's global attention capabilities with the CNN teacher's inductive biases about local features and translation equivariance.
What Is the Distillation Token?
- Definition: A trainable vector (same dimension as patch embeddings) added alongside the CLS token in the input sequence, specifically trained to match the output predictions of a pretrained CNN teacher model through knowledge distillation.
- DeiT Innovation: Introduced by Touvron et al. (Facebook AI, 2021) in "Training data-efficient image transformers & distillation through attention" as the key innovation enabling ViTs to train effectively on ImageNet-1K alone (without JFT-300M).
- Dual Token System: The input sequence becomes [CLS, distill, patch_1, ..., patch_N] with two special tokens — CLS trained on ground truth labels, distill trained on teacher predictions.
- Teacher Model: Typically a strong CNN such as RegNetY-16GF or EfficientNet that provides soft label targets for the distillation token.
Why the Distillation Token Matters
- Data Efficiency: Original ViT required JFT-300M (300M images) to outperform CNNs — DeiT with distillation matches or exceeds ViT performance using only ImageNet-1K (1.28M images), a 234× data reduction.
- Inductive Bias Transfer: CNNs have built-in translation equivariance and locality bias — the distillation token transfers these inductive biases to the transformer without modifying its architecture.
- Complementary Representations: The CLS token and distillation token learn different representations — CLS optimizes for ground truth labels while distill captures the teacher's learned feature preferences, and their combination is stronger than either alone.
- No Architecture Change: Distillation is achieved by simply adding one extra token and one extra loss term — the transformer architecture itself remains unmodified.
- Training Speed: DeiT with distillation converges faster than standard ViT training, reducing the compute budget needed for competitive vision transformer training.
How Distillation Token Works
Training Setup:
- Teacher: Pretrained CNN (e.g., RegNetY-16GF with 84.0% ImageNet accuracy).
- Student: DeiT transformer with both CLS and distillation tokens.
- Two parallel loss functions computed simultaneously.
Loss Function:
- CLS Loss: Standard cross-entropy between CLS token prediction and ground truth label.
- Distillation Loss: Cross-entropy or KL divergence between distillation token prediction and teacher's soft predictions.
- Total Loss: L = (1-α) × L_cls + α × L_distill, where α balances the two losses (typically α = 0.5).
Hard vs. Soft Distillation:
- Soft Distillation: Student matches the teacher's probability distribution (soft labels with temperature scaling). Standard knowledge distillation approach.
- Hard Distillation: Student matches the teacher's argmax prediction (hard label). Surprisingly, hard distillation works better for DeiT — simpler and more effective.
Inference:
- Both CLS and distillation token outputs are averaged (or concatenated) to produce the final prediction.
- The combined prediction outperforms either token alone.
DeiT Performance Results
| Model | Params | ImageNet Top-1 | Training Data | Teacher |
|-------|--------|---------------|---------------|---------|
| ViT-B/16 (no distill) | 86M | 77.9% | ImageNet-1K | None |
| DeiT-B (no distill) | 86M | 81.8% | ImageNet-1K | None |
| DeiT-B (distilled) | 87M | 83.4% | ImageNet-1K | RegNetY-16GF |
| DeiT-B (distilled) | 87M | 85.2% | ImageNet-1K | CaiT-M48 |
Key Insights
- CNN Teachers > Transformer Teachers: Using a CNN as the teacher works better than using a larger transformer — the complementary inductive biases provide more information gain.
- Hard Labels Outperform Soft Labels: Counter-intuitively, hard-label distillation outperforms soft-label distillation for DeiT, suggesting the teacher's confident predictions provide cleaner learning signals.
- Token Specialization: Analysis shows the CLS token and distillation token attend to different image regions — CLS focuses on discriminative object parts while distill mirrors the CNN's attention patterns.
The distillation token is the key innovation that democratized Vision Transformer training — by learning from a CNN teacher through a simple additional token, DeiT proved that powerful ViTs could be trained on standard academic datasets without requiring Google-scale private data.