Knowledge Distillation

Keywords: distillation,student teacher,compress

Knowledge Distillation

What is Distillation?
Training a smaller "student" model to mimic a larger "teacher" model, transferring knowledge while reducing size.

Basic Distillation
``python
def distillation_loss(student_logits, teacher_logits, labels, alpha=0.5, temperature=2.0):
# Soft targets from teacher
teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)
soft_loss = F.kl_div(student_log_probs, teacher_probs, reduction="batchmean")

# Hard targets (actual labels)
hard_loss = F.cross_entropy(student_logits, labels)

# Combined
return alpha soft_loss (temperature * 2) + (1 - alpha) hard_loss
`

Training Loop
`python
student = SmallModel()
teacher = LargeModel()
teacher.eval() # Freeze teacher

optimizer = torch.optim.Adam(student.parameters())

for batch in dataloader:
inputs, labels = batch

with torch.no_grad():
teacher_logits = teacher(inputs)

student_logits = student(inputs)

loss = distillation_loss(student_logits, teacher_logits, labels)

loss.backward()
optimizer.step()
optimizer.zero_grad()
`

LLM-Specific Distillation

Response Distillation
Train on teacher outputs:
`python
# Generate training data from teacher
for prompt in prompts:
response = teacher.generate(prompt)
training_data.append((prompt, response))

# Fine-tune student on this data
student.finetune(training_data)
`

Intermediate Layer Distillation
Match hidden states, not just outputs:
`python
def layer_distillation_loss(student_hidden, teacher_hidden):
# Project student hidden to teacher dimension
projected = student.projector(student_hidden)

# MSE between intermediate representations
return F.mse_loss(projected, teacher_hidden)
``

Distillation Variants
| Variant | Description |
|---------|-------------|
| Response distillation | Train on teacher outputs |
| Feature distillation | Match intermediate features |
| Attention distillation | Match attention patterns |
| Self-distillation | Distill larger to smaller version of same arch |

Popular Distilled Models
| Student | Teacher | Size Reduction |
|---------|---------|----------------|
| DistilBERT | BERT | 40% smaller |
| TinyLlama | Llama | 90% smaller |
| Phi | Unknown | Efficient from scratch |

Benefits
| Benefit | Description |
|---------|-------------|
| Speed | Smaller models run faster |
| Memory | Lower deployment costs |
| Deployment | Edge/mobile friendly |
| Privacy | Can run locally |

Best Practices
- Use high temperature (2-20) for soft labels
- Train on diverse data
- Consider intermediate layer matching
- Evaluate on task-specific benchmarks
- Try progressive distillation for very small students

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT