Knowledge Distillation

Knowledge Distillation

What is Distillation?
Training a smaller "student" model to mimic a larger "teacher" model, transferring knowledge while reducing size.

Basic Distillation
``python def distillation_loss(student_logits, teacher_logits, labels, alpha=0.5, temperature=2.0): # Soft targets from teacher teacher_probs = F.softmax(teacher_logits / temperature, dim=-1) student_log_probs = F.log_softmax(student_logits / temperature, dim=-1) soft_loss = F.kl_div(student_log_probs, teacher_probs, reduction="batchmean")

# Hard targets (actual labels) hard_loss = F.cross_entropy(student_logits, labels)

# Combined return alpha soft_loss (temperature * 2) + (1 - alpha) hard_loss`

Training Loop`python student = SmallModel() teacher = LargeModel() teacher.eval() # Freeze teacher

optimizer = torch.optim.Adam(student.parameters())

for batch in dataloader: inputs, labels = batch

with torch.no_grad(): teacher_logits = teacher(inputs)

student_logits = student(inputs)

loss = distillation_loss(student_logits, teacher_logits, labels)

loss.backward() optimizer.step() optimizer.zero_grad()`

LLM-Specific Distillation

Response Distillation Train on teacher outputs:`python # Generate training data from teacher for prompt in prompts: response = teacher.generate(prompt) training_data.append((prompt, response))

# Fine-tune student on this data student.finetune(training_data)`

Intermediate Layer Distillation Match hidden states, not just outputs:`python def layer_distillation_loss(student_hidden, teacher_hidden): # Project student hidden to teacher dimension projected = student.projector(student_hidden)

# MSE between intermediate representations return F.mse_loss(projected, teacher_hidden)``

Distillation Variants
| Variant | Description |
|---------|-------------|
| Response distillation | Train on teacher outputs |
| Feature distillation | Match intermediate features |
| Attention distillation | Match attention patterns |
| Self-distillation | Distill larger to smaller version of same arch |

Popular Distilled Models
| Student | Teacher | Size Reduction |
|---------|---------|----------------|
| DistilBERT | BERT | 40% smaller |
| TinyLlama | Llama | 90% smaller |
| Phi | Unknown | Efficient from scratch |

Benefits
| Benefit | Description |
|---------|-------------|
| Speed | Smaller models run faster |
| Memory | Lower deployment costs |
| Deployment | Edge/mobile friendly |
| Privacy | Can run locally |

Best Practices
- Use high temperature (2-20) for soft labels
- Train on diverse data
- Consider intermediate layer matching
- Evaluate on task-specific benchmarks
- Try progressive distillation for very small students

Want to learn more?