Home Knowledge Base Knowledge Distillation

Knowledge Distillation

What is Distillation? Training a smaller "student" model to mimic a larger "teacher" model, transferring knowledge while reducing size.

Basic Distillation

def distillation_loss(student_logits, teacher_logits, labels, alpha=0.5, temperature=2.0):
    # Soft targets from teacher
    teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
    student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)
    soft_loss = F.kl_div(student_log_probs, teacher_probs, reduction="batchmean")

    # Hard targets (actual labels)
    hard_loss = F.cross_entropy(student_logits, labels)

    # Combined
    return alpha * soft_loss * (temperature ** 2) + (1 - alpha) * hard_loss

Training Loop

student = SmallModel()
teacher = LargeModel()
teacher.eval()  # Freeze teacher

optimizer = torch.optim.Adam(student.parameters())

for batch in dataloader:
    inputs, labels = batch

    with torch.no_grad():
        teacher_logits = teacher(inputs)

    student_logits = student(inputs)

    loss = distillation_loss(student_logits, teacher_logits, labels)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

LLM-Specific Distillation

Response Distillation Train on teacher outputs:

# Generate training data from teacher
for prompt in prompts:
    response = teacher.generate(prompt)
    training_data.append((prompt, response))

# Fine-tune student on this data
student.finetune(training_data)

Intermediate Layer Distillation Match hidden states, not just outputs:

def layer_distillation_loss(student_hidden, teacher_hidden):
    # Project student hidden to teacher dimension
    projected = student.projector(student_hidden)

    # MSE between intermediate representations
    return F.mse_loss(projected, teacher_hidden)

Distillation Variants

VariantDescription
Response distillationTrain on teacher outputs
Feature distillationMatch intermediate features
Attention distillationMatch attention patterns
Self-distillationDistill larger to smaller version of same arch

Popular Distilled Models

StudentTeacherSize Reduction
DistilBERTBERT40% smaller
TinyLlamaLlama90% smaller
PhiUnknownEfficient from scratch

Benefits

BenefitDescription
SpeedSmaller models run faster
MemoryLower deployment costs
DeploymentEdge/mobile friendly
PrivacyCan run locally

Best Practices

distillationstudent teachercompress

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.