Self-Training (Pseudo-Labeling)

Self-Training (Pseudo-Labeling) is the semi-supervised learning technique where a model trained on labeled data generates predictions (pseudo-labels) on unlabeled data, then retrains on the combined labeled and pseudo-labeled dataset — leveraging large amounts of unlabeled data to improve model performance beyond what the limited labeled data alone can achieve, with modern variants like Noisy Student achieving state-of-the-art results across vision and language tasks.

Basic Self-Training Loop

1. Train teacher model M on labeled dataset D_L.
2. Use M to predict labels for unlabeled dataset D_U → pseudo-labels.
3. Filter/weight pseudo-labels by confidence (threshold τ).
4. Combine: D_train = D_L ∪ D_U(filtered).
5. Train student model on D_train.
6. (Optional) Iterate: Student becomes new teacher → repeat.

Confidence Thresholding

| Threshold (τ) | Effect |
|--------------|--------|
| High (0.95+) | Few pseudo-labels, high quality → slow learning |
| Medium (0.8-0.95) | Balance quality and quantity → usually optimal |
| Low (0.5-0.8) | Many pseudo-labels, noisy → can degrade model |
| Curriculum | Start high, decrease over time → progressive expansion |

Noisy Student Training (Xie et al., 2020)

- Teacher generates pseudo-labels for unlabeled ImageNet (300M images).
- Student trained with noise: Strong data augmentation (RandAugment), dropout, stochastic depth.
- Key insight: Student should be trained in harder conditions than teacher predicted under.
- Equal-or-larger student model → absorbs more information from data.
- Result: EfficientNet-L2 with Noisy Student → 88.4% top-1 on ImageNet (SOTA at the time).

Self-Training in NLP

| Method | Domain | Approach |
|--------|--------|----------|
| Back-Translation | Machine Translation | Translate target→source, use as pseudo-parallel data |
| Self-Training LLM | Text Classification | LLM labels unlabeled text, fine-tune smaller model |
| PET / iPET | Few-Shot NLP | Pattern-based self-training with cloze-style prompts |
| UDA | General NLP | Consistency training with augmented pseudo-labeled data |

Confirmation Bias Problem

- Risk: If teacher makes systematic errors → pseudo-labels propagate errors → student inherits and amplifies mistakes.
- Mitigations:
- High confidence threshold.
- Noise/augmentation during student training.
- Multiple rounds with fresh random initialization.
- Mix real labels with pseudo-labels (weight real labels higher).
- Co-training: Two models label data for each other.

Self-Training vs. Other Semi-Supervised Methods

| Method | Advantage | Disadvantage |
|--------|----------|-------------|
| Self-Training | Simple, works with any model | Confirmation bias, threshold sensitivity |
| Consistency Regularization | No explicit labels needed | Requires augmentation design |
| Contrastive Learning | Strong representations | Doesn't directly use labels |
| FixMatch | Combines pseudo-labeling + consistency | More complex implementation |

Self-training is one of the most practical semi-supervised learning techniques — its simplicity, generality across modalities, and strong empirical results make it the go-to approach when abundant unlabeled data is available alongside limited labels, particularly in specialized domains where annotation is expensive.

Want to learn more?