Self-distillation trains a model to match its own predictions on augmented or different views of data — using the model itself as both teacher and student to improve consistency, regularization, and representation quality without requiring a separate larger model.
What Is Self-Distillation?
- Definition: Model learns from its own predictions.
- Mechanism: Match predictions across augmentations or training stages.
- Goal: Improve consistency and generalization.
- Advantage: No separate teacher model needed.
Why Self-Distillation Works
- Consistency Regularization: Same input should give same output.
- Dark Knowledge: Soft predictions contain useful structure.
- Ensemble Effect: Different views create implicit ensemble.
- Denoising: Averaged predictions reduce noise.
Types of Self-Distillation
Temporal Self-Distillation (Born-Again Networks):
``
1. Train model to convergence
2. Use final model as teacher
3. Train new model (same architecture) to match it
4. Repeat: often improves each generation
Model_1 → teaches → Model_2 → teaches → Model_3
(often better than Model_1)
`
Layer-wise Self-Distillation:
`
Deep layers (teacher) → Shallow layers (student)
┌─────────────────────────────────────────┐
│ Layer 12 prediction ←─ final output │
│ │ │
│ ├── distill to ──→ Layer 6 pred │
│ │ │
│ └── distill to ──→ Layer 3 pred │
└─────────────────────────────────────────┘
`
Augmentation-Based:
`
Original image → Prediction A
Augmented image → Prediction B
Loss: Match A and B (both from same model)
`
Implementation
Augmentation Consistency:
`python
import torch
import torch.nn.functional as F
def self_distillation_loss(model, x, augment_fn, temperature=4.0):
# Original prediction (teacher signal)
with torch.no_grad():
teacher_logits = model(x)
teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
# Augmented prediction (student signal)
x_aug = augment_fn(x)
student_logits = model(x_aug)
student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)
# Consistency loss
consistency_loss = F.kl_div(
student_log_probs,
teacher_probs,
reduction="batchmean"
) (temperature * 2)
return consistency_loss
`
Born-Again Training:
`python`
def born_again_training(model_class, dataset, generations=3):
"""Train successive generations of self-distillation."""
# Initial training
current_model = model_class()
train_standard(current_model, dataset)
for gen in range(generations - 1):
# Current model becomes teacher
teacher = current_model.eval()
# New student (same architecture)
student = model_class()
# Train student to match teacher
for x, y in dataset:
with torch.no_grad():
teacher_logits = teacher(x)
student_logits = student(x)
# Combine task loss and distillation loss
task_loss = F.cross_entropy(student_logits, y)
distill_loss = kl_divergence(student_logits, teacher_logits)
loss = 0.5 task_loss + 0.5 distill_loss
loss.backward()
optimizer.step()
current_model = student
print(f"Generation {gen + 1} complete")
return current_model
Deep Layer Self-Distillation:
`python`
class SelfDistillationModel(nn.Module):
def __init__(self, base_model, num_classes):
super().__init__()
self.backbone = base_model
# Auxiliary classifiers at intermediate layers
self.aux_classifiers = nn.ModuleList([
nn.Linear(hidden_dim, num_classes)
for hidden_dim in intermediate_dims
])
self.final_classifier = nn.Linear(final_dim, num_classes)
def forward(self, x):
# Get intermediate features
features = self.backbone.get_intermediate_features(x)
# Auxiliary predictions
aux_logits = [clf(feat) for clf, feat in
zip(self.aux_classifiers, features[:-1])]
# Final prediction
final_logits = self.final_classifier(features[-1])
return final_logits, aux_logits
def compute_loss(self, x, labels):
final_logits, aux_logits = self.forward(x)
# Task loss
task_loss = F.cross_entropy(final_logits, labels)
# Self-distillation: intermediate layers match final
soft_targets = F.softmax(final_logits.detach() / 4.0, dim=-1)
distill_loss = sum(
F.kl_div(F.log_softmax(aux / 4.0, dim=-1), soft_targets)
for aux in aux_logits
)
return task_loss + 0.3 * distill_loss
Applications
DINO (Self-Supervised Vision):
``
- Student and teacher share weights (EMA update)
- Different crops → should give same representation
- Learns powerful visual representations without labels
Language Models:
``
- Predict same output for paraphrased inputs
- Match representations of semantically similar text
- Improve robustness to input variations
Benefits vs. Standard K.D.
```
Aspect | Self-Distillation | Teacher-Student
--------------------|--------------------|-----------------
Teacher required | No | Yes
Architecture | Same | Different allowed
Training simplicity | Higher | Lower
Max performance | Good | Better (bigger teacher)
Use case | Regularization | Compression
Self-distillation is a powerful regularization technique — by forcing models to be consistent across views or to match their own refined predictions, it improves generalization without the complexity of maintaining separate teacher models.