Knowledge Distillation
What is Distillation?
Training a smaller "student" model to mimic a larger "teacher" model, transferring knowledge while reducing size.
Basic Distillation
``python
def distillation_loss(student_logits, teacher_logits, labels, alpha=0.5, temperature=2.0):
# Soft targets from teacher
teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)
soft_loss = F.kl_div(student_log_probs, teacher_probs, reduction="batchmean")
# Hard targets (actual labels)
hard_loss = F.cross_entropy(student_logits, labels)
# Combined
return alpha soft_loss (temperature * 2) + (1 - alpha) hard_loss
`
Training Loop
`python
student = SmallModel()
teacher = LargeModel()
teacher.eval() # Freeze teacher
optimizer = torch.optim.Adam(student.parameters())
for batch in dataloader:
inputs, labels = batch
with torch.no_grad():
teacher_logits = teacher(inputs)
student_logits = student(inputs)
loss = distillation_loss(student_logits, teacher_logits, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
`
LLM-Specific Distillation
Response Distillation
Train on teacher outputs:
`python
# Generate training data from teacher
for prompt in prompts:
response = teacher.generate(prompt)
training_data.append((prompt, response))
# Fine-tune student on this data
student.finetune(training_data)
`
Intermediate Layer Distillation
Match hidden states, not just outputs:
`python
def layer_distillation_loss(student_hidden, teacher_hidden):
# Project student hidden to teacher dimension
projected = student.projector(student_hidden)
# MSE between intermediate representations
return F.mse_loss(projected, teacher_hidden)
``
Distillation Variants
| Variant | Description |
|---------|-------------|
| Response distillation | Train on teacher outputs |
| Feature distillation | Match intermediate features |
| Attention distillation | Match attention patterns |
| Self-distillation | Distill larger to smaller version of same arch |
Popular Distilled Models
| Student | Teacher | Size Reduction |
|---------|---------|----------------|
| DistilBERT | BERT | 40% smaller |
| TinyLlama | Llama | 90% smaller |
| Phi | Unknown | Efficient from scratch |
Benefits
| Benefit | Description |
|---------|-------------|
| Speed | Smaller models run faster |
| Memory | Lower deployment costs |
| Deployment | Edge/mobile friendly |
| Privacy | Can run locally |
Best Practices
- Use high temperature (2-20) for soft labels
- Train on diverse data
- Consider intermediate layer matching
- Evaluate on task-specific benchmarks
- Try progressive distillation for very small students