Home Knowledge Base Distillation token

Distillation token is a special learnable embedding used in DeiT (Data-efficient Image Transformers) that learns from a CNN teacher model's predictions — enabling Vision Transformers to achieve strong performance with significantly less training data by combining the transformer's global attention capabilities with the CNN teacher's inductive biases about local features and translation equivariance.

What Is the Distillation Token?

Why the Distillation Token Matters

How Distillation Token Works

Training Setup:

Loss Function:

Hard vs. Soft Distillation:

Inference:

DeiT Performance Results

ModelParamsImageNet Top-1Training DataTeacher
ViT-B/16 (no distill)86M77.9%ImageNet-1KNone
DeiT-B (no distill)86M81.8%ImageNet-1KNone
DeiT-B (distilled)87M83.4%ImageNet-1KRegNetY-16GF
DeiT-B (distilled)87M85.2%ImageNet-1KCaiT-M48

Key Insights

The distillation token is the key innovation that democratized Vision Transformer training — by learning from a CNN teacher through a simple additional token, DeiT proved that powerful ViTs could be trained on standard academic datasets without requiring Google-scale private data.

distillation tokencomputer vision

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.