Self-Training (Pseudo-Labeling) is the semi-supervised learning technique where a model trained on labeled data generates predictions (pseudo-labels) on unlabeled data, then retrains on the combined labeled and pseudo-labeled dataset β leveraging large amounts of unlabeled data to improve model performance beyond what the limited labeled data alone can achieve, with modern variants like Noisy Student achieving state-of-the-art results across vision and language tasks.
Basic Self-Training Loop
1. Train teacher model M on labeled dataset D_L.
2. Use M to predict labels for unlabeled dataset D_U β pseudo-labels.
3. Filter/weight pseudo-labels by confidence (threshold Ο).
4. Combine: D_train = D_L βͺ D_U(filtered).
5. Train student model on D_train.
6. (Optional) Iterate: Student becomes new teacher β repeat.
Confidence Thresholding
| Threshold (Ο) | Effect |
|--------------|--------|
| High (0.95+) | Few pseudo-labels, high quality β slow learning |
| Medium (0.8-0.95) | Balance quality and quantity β usually optimal |
| Low (0.5-0.8) | Many pseudo-labels, noisy β can degrade model |
| Curriculum | Start high, decrease over time β progressive expansion |
Noisy Student Training (Xie et al., 2020)
- Teacher generates pseudo-labels for unlabeled ImageNet (300M images).
- Student trained with noise: Strong data augmentation (RandAugment), dropout, stochastic depth.
- Key insight: Student should be trained in harder conditions than teacher predicted under.
- Equal-or-larger student model β absorbs more information from data.
- Result: EfficientNet-L2 with Noisy Student β 88.4% top-1 on ImageNet (SOTA at the time).
Self-Training in NLP
| Method | Domain | Approach |
|--------|--------|----------|
| Back-Translation | Machine Translation | Translate targetβsource, use as pseudo-parallel data |
| Self-Training LLM | Text Classification | LLM labels unlabeled text, fine-tune smaller model |
| PET / iPET | Few-Shot NLP | Pattern-based self-training with cloze-style prompts |
| UDA | General NLP | Consistency training with augmented pseudo-labeled data |
Confirmation Bias Problem
- Risk: If teacher makes systematic errors β pseudo-labels propagate errors β student inherits and amplifies mistakes.
- Mitigations:
- High confidence threshold.
- Noise/augmentation during student training.
- Multiple rounds with fresh random initialization.
- Mix real labels with pseudo-labels (weight real labels higher).
- Co-training: Two models label data for each other.
Self-Training vs. Other Semi-Supervised Methods
| Method | Advantage | Disadvantage |
|--------|----------|-------------|
| Self-Training | Simple, works with any model | Confirmation bias, threshold sensitivity |
| Consistency Regularization | No explicit labels needed | Requires augmentation design |
| Contrastive Learning | Strong representations | Doesn't directly use labels |
| FixMatch | Combines pseudo-labeling + consistency | More complex implementation |
Self-training is one of the most practical semi-supervised learning techniques β its simplicity, generality across modalities, and strong empirical results make it the go-to approach when abundant unlabeled data is available alongside limited labels, particularly in specialized domains where annotation is expensive.