Knowledge Distillation

Keywords: knowledge distillation training,teacher student network,soft label distillation,feature distillation intermediate,distillation temperature scaling

Knowledge Distillation is the model compression technique where a large, high-performing teacher model transfers its learned representations to a smaller, more efficient student model โ€” training the student to mimic the teacher's soft probability distributions rather than just the hard ground-truth labels, enabling the student to capture inter-class relationships and decision boundaries that hard labels cannot convey.

Distillation Framework:
- Soft Labels: teacher's output probabilities (after softmax) contain rich information; for a cat image, the teacher might output [cat: 0.85, dog: 0.10, fox: 0.04, ...] โ€” these relative probabilities tell the student that cats look somewhat like dogs, which hard one-hot labels [cat: 1, rest: 0] cannot express
- Temperature Scaling: softmax temperature T controls the entropy of the teacher's output distribution; higher T (2-20) softens the distribution, making small probabilities more visible; distillation loss uses temperature T; inference uses T=1
- Combined Loss: student minimizes ฮฑยทKL(teacher_soft, student_soft) + (1-ฮฑ)ยทCE(ground_truth, student_hard); typical ฮฑ=0.5-0.9; the soft label loss provides the teacher's dark knowledge while the hard label loss anchors to ground truth
- Offline vs Online: offline distillation pre-computes teacher outputs for the entire dataset; online distillation runs teacher and student simultaneously, allowing the teacher to continue improving during distillation

Distillation Strategies:
- Logit Distillation (Hinton): student matches teacher's final softmax output distribution; simplest and most common; effective for classification tasks but loses intermediate feature information
- Feature Distillation (FitNets): student matches teacher's intermediate feature maps at selected layers; requires adaptation layers (1ร—1 convolutions) when teacher and student have different channel dimensions; captures richer representational knowledge than logit-only distillation
- Attention Transfer: student matches teacher's attention maps (spatial or channel attention patterns); forces the student to focus on the same regions as the teacher โ€” particularly effective for vision models
- Relational Distillation: student preserves the relationships between sample representations (e.g., pairwise distances or angles in embedding space) rather than matching individual outputs โ€” captures structural knowledge invariant to representation scale

Advanced Techniques:
- Self-Distillation: model distills knowledge from its own deeper layers to shallower layers, or from later training epochs to earlier epochs; no separate teacher required; improves accuracy by 1-3% on image classification
- Multi-Teacher Distillation: ensemble of diverse teacher models provides averaged or combined soft labels; student learns from the collective knowledge of multiple specialists; ensemble agreement regions receive stronger teaching signal
- Progressive Distillation: chain of progressively smaller students, each distilling from the previous one rather than directly from the large teacher; bridges large capacity gaps that single-step distillation struggles with
- Task-Specific Distillation: for LLMs, distillation on task-specific data (instruction-following, code generation, reasoning) is more efficient than general distillation; DistilBERT, TinyLlama, and Phi models demonstrate task-focused distillation

Results and Applications:
- Compression Ratios: typical 4-10ร— parameter reduction with <2% accuracy loss; DistilBERT achieves 97% of BERT performance with 40% fewer parameters and 60% faster inference
- Cross-Architecture: teacher and student can have different architectures (CNN teacher โ†’ efficient architecture student); knowledge transfers across architecture families
- Deployment: distilled models deployed on edge devices (phones, embedded systems) where teacher models are too large; enables state-of-the-art accuracy within strict latency and memory budgets

Knowledge distillation is the most practical technique for deploying large model capabilities on resource-constrained hardware โ€” transferring the dark knowledge embedded in teacher probability distributions to compact student models, enabling the accuracy benefits of massive models to reach every device and application.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT