Home Knowledge Base Knowledge Masking

Knowledge Masking is the pre-training strategy that uses external knowledge bases or linguistic analysis to define semantically meaningful masking units — treating named entities, concepts, and phrases as atomic units for masking rather than randomly selected subword tokens, forcing the model to learn to predict entire real-world concepts from context rather than reconstructing word fragments from adjacent characters.

The Limitation of Token-Level Masking

Standard BERT masks individual WordPiece subwords. When "Barack Obama" is tokenized as ["Barack", "##O", "##bam", "##a"], masking only the token "##bam" makes prediction trivial: the model reconstructs the word from visible fragments "Barack", "##O", and "##a" without learning anything meaningful about Barack Obama as a real-world entity.

Knowledge Masking addresses this by treating "Barack Obama" as a single indivisible semantic unit. When masked, all four subword tokens are replaced simultaneously, forcing the model to predict the entity's identity entirely from surrounding context: "the 44th president of the United States," "delivered his inaugural address," "former senator from Illinois." The model must learn what these contextual signals say about a specific real-world person.

ERNIE (Baidu) — The Canonical Implementation

ERNIE 1.0 (2019) from Baidu introduced three-level structured masking:

Basic-Level Masking: Random token masking identical to BERT — establishes baseline language modeling capability and recovers individual word statistics.

Phrase-Level Masking: Uses linguistic analysis (constituency parsing, POS tagging, dependency parsing) to identify multiword expressions. Masks entire phrases as units: "New York City," "machine learning," "Nobel Prize in Physics," "rate of return." The model must predict the complete phrase concept, not individual words.

Entity-Level Masking: Uses a named entity recognition (NER) system to identify entity spans. Masks all tokens of each entity simultaneously: person names, location names, organization names, product names, dates. The model predicts the entity identity from surrounding discourse.

ERNIE 1.0 demonstrated significant improvements on Chinese NLP tasks — benefits especially pronounced because Chinese text has no spaces, making character-level masking structurally similar to subword masking in English, while entity boundaries are linguistically meaningful in ways that character boundaries are not.

Knowledge Graph Integration

ERNIE 2.0 and ERNIE 3.0 extend knowledge masking to integrate structured symbolic knowledge:

Salient Span Masking (Google)

Google's T5 and related models use salient span masking: candidate spans are scored by TF-IDF across the corpus. Rare, informative spans (specific names, technical terms, unusual phrases) are selected for masking with probability proportional to their information content. Common function words and stopwords are rarely masked. This approximates entity masking without requiring an explicit NER pipeline or knowledge graph.

Comparison of Masking Strategies

VariantMasking UnitExternal Resource Required
Random TokenIndividual subwordNone
Whole WordAll subwords of a wordWord boundary information
Phrase MaskingMultiword expressionPOS tagger, chunker
Entity MaskingNamed entity spanNER system
Knowledge MaskingKnowledge graph entityKG + entity linker
Salient SpanHigh-information spanTF-IDF corpus statistics

Benefits for Downstream Tasks

Knowledge masking consistently improves performance on entity-centric tasks:

For Non-Latin Languages

The benefit of knowledge masking is especially strong for:

Knowledge Masking is hiding concepts rather than characters — using semantic knowledge to define masking boundaries, forcing the model to learn the identity of real-world objects from context rather than reconstructing word forms from adjacent fragments.

knowledge maskingnlp

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.