Advanced Knowledge Distillation

Advanced Knowledge Distillation encompasses sophisticated techniques for transferring knowledge from a large teacher model to a smaller student model beyond basic soft-label matching — including intermediate feature distillation, attention transfer, relational knowledge distillation, and task-specific distillation strategies that enable students to capture structural knowledge the teacher has learned, not just its output predictions.

Beyond Basic Logit Distillation

Basic KD (Hinton 2015) matches the student's softmax output to the teacher's soft labels. Advanced methods distill knowledge from multiple levels:

``Teacher Model Student Model ┌─────────────┐ ┌──────────────┐ │ Input Layer │ │ Input Layer │ ├─────────────┤ ├──────────────┤ │ Hidden 1 │──── Feature ────→│ Hidden 1 │ │ Hidden 2 │ Distillation │ │ │ Hidden 3 │──── Attention ──→│ Hidden 2 │ │ Hidden 4 │ Transfer │ │ ├─────────────┤ ├──────────────┤ │ Logits │──── Logit KD ──→│ Logits │ └─────────────┘ └──────────────┘`

Feature/Intermediate Layer Distillation

| Method | What is Distilled | Loss | |--------|------------------|------| | FitNets | Hidden layer activations | MSE(student_feat, teacher_feat) via adapter | | Attention Transfer (AT) | Attention maps (spatial) | MSE on attention map norm | | PKT (Probabilistic KT) | Feature distribution in embedding space | KL divergence | | NST (Neuron Selectivity) | Neuron activation distributions | MMD (Maximum Mean Discrepancy) | | CRD (Contrastive Rep. Dist.) | Representation structure | Contrastive loss |

Adapter layers (1×1 conv or linear projection) bridge dimension mismatches between teacher and student hidden layers.

Relational Knowledge Distillation (RKD)

Instead of matching individual outputs, RKD transfers the relationships between samples:

`python # Distance-wise RKD: preserve pairwise distance structure teacher_dist = pairwise_distance(teacher_embeddings) # NxN matrix student_dist = pairwise_distance(student_embeddings) loss_rkd = huber_loss(student_dist / student_dist.mean(), teacher_dist / teacher_dist.mean())

# Angle-wise RKD: preserve angular relationships among triplets # Captures higher-order structural information``

LLM-Specific Distillation

Distilling large language models has unique considerations:

- Black-box distillation: When teacher weights are inaccessible (GPT-4 → open model), use only generated outputs. Techniques: instruction following data generation, chain-of-thought distillation (distill reasoning traces, not just answers).
- White-box distillation: With teacher weight access, match logit distributions over the full vocabulary (50K+ dimension KL divergence) and intermediate transformer layer representations.
- Progressive distillation: Gradually reduce model size through multiple distillation stages rather than one large compression step.
- Distillation for specific capabilities: Selectively distill math reasoning, code generation, or instruction following by curating task-specific transfer sets.

Multi-Teacher and Self-Distillation

- Multi-teacher: Ensemble of specialists, each contributing expertise. Student learns from domain-weighted combination of teacher outputs.
- Self-distillation: Model distills knowledge from its own deeper layers to shallower layers, or from a previous training epoch to the current one.
- Born-Again Networks: Iteratively distill — student becomes the new teacher for the next round, often surpassing the original teacher.

Advanced knowledge distillation is the primary model compression technique enabling deployment of LLM-class intelligence on resource-constrained devices — by transferring not just predictions but structural, relational, and intermediate representations, modern distillation achieves compression ratios of 10-100× while retaining 90-98% of teacher performance.

Want to learn more?