Home Knowledge Base Advanced Knowledge Distillation

Advanced Knowledge Distillation encompasses sophisticated techniques for transferring knowledge from a large teacher model to a smaller student model beyond basic soft-label matching — including intermediate feature distillation, attention transfer, relational knowledge distillation, and task-specific distillation strategies that enable students to capture structural knowledge the teacher has learned, not just its output predictions.

Beyond Basic Logit Distillation

Basic KD (Hinton 2015) matches the student's softmax output to the teacher's soft labels. Advanced methods distill knowledge from multiple levels:

Teacher Model                    Student Model
┌─────────────┐                  ┌──────────────┐
│ Input Layer  │                  │ Input Layer   │
├─────────────┤                  ├──────────────┤
│ Hidden 1    │──── Feature ────→│ Hidden 1      │
│ Hidden 2    │    Distillation  │              │
│ Hidden 3    │──── Attention ──→│ Hidden 2      │
│ Hidden 4    │    Transfer      │              │
├─────────────┤                  ├──────────────┤
│ Logits      │──── Logit KD ──→│ Logits        │
└─────────────┘                  └──────────────┘

Feature/Intermediate Layer Distillation

MethodWhat is DistilledLoss
FitNetsHidden layer activationsMSE(student_feat, teacher_feat) via adapter
Attention Transfer (AT)Attention maps (spatial)MSE on attention map norm
PKT (Probabilistic KT)Feature distribution in embedding spaceKL divergence
NST (Neuron Selectivity)Neuron activation distributionsMMD (Maximum Mean Discrepancy)
CRD (Contrastive Rep. Dist.)Representation structureContrastive loss

Adapter layers (1×1 conv or linear projection) bridge dimension mismatches between teacher and student hidden layers.

Relational Knowledge Distillation (RKD)

Instead of matching individual outputs, RKD transfers the relationships between samples:

# Distance-wise RKD: preserve pairwise distance structure
teacher_dist = pairwise_distance(teacher_embeddings)  # NxN matrix
student_dist = pairwise_distance(student_embeddings)
loss_rkd = huber_loss(student_dist / student_dist.mean(),
                      teacher_dist / teacher_dist.mean())

# Angle-wise RKD: preserve angular relationships among triplets
# Captures higher-order structural information

LLM-Specific Distillation

Distilling large language models has unique considerations:

Multi-Teacher and Self-Distillation

Advanced knowledge distillation is the primary model compression technique enabling deployment of LLM-class intelligence on resource-constrained devices — by transferring not just predictions but structural, relational, and intermediate representations, modern distillation achieves compression ratios of 10-100× while retaining 90-98% of teacher performance.

knowledge distillation advancedfeature distillationlogit distillationintermediate layer distillationdistillation loss

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.