Knowledge distillation loss

Keywords: distillation loss, soft target, kd, temperature, kl divergence

Knowledge distillation loss matches student model outputs to teacher model soft targets — using the probability distributions (soft labels) from a larger teacher model rather than hard labels, enabling knowledge transfer that captures richer information about relationships between classes.

What Is Distillation Loss?

- Definition: Loss that encourages student to match teacher predictions.
- Soft Targets: Teacher's probability distribution over classes.
- Temperature: Softens distributions to reveal more structure.
- Combination: Usually combined with standard task loss.

Why Soft Targets Work

- Rich Information: "Cat 0.7, Tiger 0.2, Dog 0.1" vs. just "Cat."
- Dark Knowledge: Wrong answers reveal learned relationships.
- Regularization: Smoother targets prevent overconfident students.
- Efficient Learning: Student learns patterns, not just labels.

Distillation Loss Formula

Standard KD Loss:
``
L_total = α × L_hard + (1-α) × L_soft

Where:
L_hard = CrossEntropy(student_logits, true_labels)
L_soft = KL_Divergence(
softmax(student_logits / T),
softmax(teacher_logits / T)
) × T²

Parameters:
- T: Temperature (typically 2-20)
- α: Balance factor (typically 0.1-0.5)
`

Temperature Effect:
`
T=1 (sharp):
Cat: 0.95, Dog: 0.03, Bird: 0.02

T=5 (soft):
Cat: 0.45, Dog: 0.30, Bird: 0.25

Higher T → softer distributions → more dark knowledge
`

Implementation

PyTorch Distillation Loss:
`python
import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
def __init__(self, temperature=4.0, alpha=0.5):
super().__init__()
self.temperature = temperature
self.alpha = alpha
self.ce_loss = nn.CrossEntropyLoss()
self.kl_loss = nn.KLDivLoss(reduction="batchmean")

def forward(self, student_logits, teacher_logits, labels):
# Hard loss (standard cross-entropy)
hard_loss = self.ce_loss(student_logits, labels)

# Soft loss (KL divergence with temperature)
soft_student = F.log_softmax(student_logits / self.temperature, dim=-1)
soft_teacher = F.softmax(teacher_logits / self.temperature, dim=-1)
soft_loss = self.kl_loss(soft_student, soft_teacher) (self.temperature * 2)

# Combined loss
return self.alpha hard_loss + (1 - self.alpha) soft_loss

# Usage
criterion = DistillationLoss(temperature=4.0, alpha=0.5)

for inputs, labels in dataloader:
with torch.no_grad():
teacher_logits = teacher_model(inputs)

student_logits = student_model(inputs)
loss = criterion(student_logits, teacher_logits, labels)

loss.backward()
optimizer.step()
`

LLM Distillation

Sequence-Level Distillation:
`python
def llm_distillation_loss(student_logits, teacher_logits, labels, temperature=2.0):
"""Distillation for language models."""
# Shape: [batch, seq_len, vocab_size]

# Soft targets from teacher
teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)

# Student log probabilities
student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)

# KL divergence per position
kl_div = F.kl_div(
student_log_probs.view(-1, student_log_probs.size(-1)),
teacher_probs.view(-1, teacher_probs.size(-1)),
reduction="batchmean"
)

# Scale by T²
soft_loss = kl_div (temperature * 2)

# Hard loss
hard_loss = F.cross_entropy(
student_logits.view(-1, student_logits.size(-1)),
labels.view(-1),
ignore_index=-100
)

return 0.5 hard_loss + 0.5 soft_loss
`

Response-Based Distillation:
`python
# Teacher generates response
teacher_response = teacher.generate(prompt)

# Student learns to generate same response
student_loss = student.forward(prompt + teacher_response)

# Often more practical for large LLMs
`

Distillation Variants

`
Method | What to Match
--------------------|----------------------------------
Logit distillation | Final layer logits
Feature distillation| Intermediate representations
Attention distillation| Attention maps
Hidden state matching| Layer-wise hidden states
Response distillation| Generated outputs
`

Hyperparameter Guidelines

`
Parameter | Typical Values | Notes
-------------|----------------|------------------
Temperature | 2-10 | Higher for more knowledge
Alpha | 0.1-0.5 | Balance soft/hard loss
Student size | 0.1x-0.5x teacher| Smaller needs more T
Training | 1-3× normal | More epochs often help
`

Choosing Temperature:
`
Low T (1-3): When teacher is very confident
High T (5-20): When teacher has nuanced predictions
Start: T=4 is common default
Tune: Based on validation performance
``

Distillation loss is the core mechanism for transferring knowledge from large to small models — by matching soft probability distributions rather than hard labels, it captures the nuanced understanding that teachers develop, enabling students to achieve surprisingly close performance with far fewer parameters.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT