Home Knowledge Base Curriculum Masking

Curriculum Masking is the pre-training strategy for masked language models where the difficulty of the masking task increases progressively over training — applying the principle of curriculum learning (easy examples before hard ones) to the masked language modeling objective to improve training stability, accelerate convergence, and push the model toward learning more robust and generalizable representations.

The Curriculum Learning Principle

Curriculum learning, formalized by Bengio et al. (2009), observes that humans and animals learn better when presented with examples in order of increasing difficulty — mastering simple cases before confronting complex ones. Applied to masked language modeling, this principle translates to progressively harder masking challenges across the training schedule.

Standard BERT uses a fixed masking strategy throughout training: 15% of tokens are randomly selected, with 80% replaced by [MASK], 10% replaced by a random token, and 10% left unchanged. Curriculum masking questions whether this static schedule is optimal across all training stages.

Curriculum Dimensions for Masking

Masking Rate Progression:

Masking Strategy Progression:

Span Length Progression:

Difficulty-Based Adaptive Selection: Rather than a fixed schedule, select tokens for masking based on the model's current confidence. Mask positions the model currently predicts with low probability — forcing attention to genuinely hard examples. This adapts automatically to the model's evolving capability throughout training, avoiding both too-easy and too-hard masking at any given stage.

Theoretical Justification

Curriculum masking operationalizes two complementary principles:

Self-Paced Learning: Include training examples (masked positions) where the model's current confidence is within a productive learning range — neither trivially easy (gradient signal is zero) nor impossibly hard (gradient signal is noise). The masking difficulty functions as a continuous curriculum parameter tuned to the model's current state.

Zone of Proximal Development: Vygotsky's educational concept applies directly: learning is most efficient when the challenge is just beyond current capability. Fixed 15% random masking provides challenges of wildly varying difficulty simultaneously; curriculum masking focuses effort in the productive zone.

Empirical Evidence

The empirical picture is mixed but informative:

Implementations in Practice

Curriculum Masking is the progressive difficulty schedule for language model pre-training — structuring the fill-in-the-blank task to begin with easy blanks and advance to conceptually hard ones, building language representations from simple to complex following the same pedagogical principle that effective teachers apply to human learners.

curriculum maskingnlp

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.