Home Knowledge Base Self-Training (Pseudo-Labeling)

Self-Training (Pseudo-Labeling) is the semi-supervised learning technique where a model trained on labeled data generates predictions (pseudo-labels) on unlabeled data, then retrains on the combined labeled and pseudo-labeled dataset — leveraging large amounts of unlabeled data to improve model performance beyond what the limited labeled data alone can achieve, with modern variants like Noisy Student achieving state-of-the-art results across vision and language tasks.

Basic Self-Training Loop

1. Train teacher model M on labeled dataset D_L. 2. Use M to predict labels for unlabeled dataset D_U → pseudo-labels. 3. Filter/weight pseudo-labels by confidence (threshold τ). 4. Combine: D_train = D_L ∪ D_U(filtered). 5. Train student model on D_train. 6. (Optional) Iterate: Student becomes new teacher → repeat.

Confidence Thresholding

Threshold (τ)Effect
High (0.95+)Few pseudo-labels, high quality → slow learning
Medium (0.8-0.95)Balance quality and quantity → usually optimal
Low (0.5-0.8)Many pseudo-labels, noisy → can degrade model
CurriculumStart high, decrease over time → progressive expansion

Noisy Student Training (Xie et al., 2020)

Self-Training in NLP

MethodDomainApproach
Back-TranslationMachine TranslationTranslate target→source, use as pseudo-parallel data
Self-Training LLMText ClassificationLLM labels unlabeled text, fine-tune smaller model
PET / iPETFew-Shot NLPPattern-based self-training with cloze-style prompts
UDAGeneral NLPConsistency training with augmented pseudo-labeled data

Confirmation Bias Problem

Self-Training vs. Other Semi-Supervised Methods

MethodAdvantageDisadvantage
Self-TrainingSimple, works with any modelConfirmation bias, threshold sensitivity
Consistency RegularizationNo explicit labels neededRequires augmentation design
Contrastive LearningStrong representationsDoesn't directly use labels
FixMatchCombines pseudo-labeling + consistencyMore complex implementation

Self-training is one of the most practical semi-supervised learning techniques — its simplicity, generality across modalities, and strong empirical results make it the go-to approach when abundant unlabeled data is available alongside limited labels, particularly in specialized domains where annotation is expensive.

self trainingpseudo labelingsemi supervisednoisy studentteacher student self training

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.