Knowledge Masking

Knowledge Masking is the pre-training strategy that uses external knowledge bases or linguistic analysis to define semantically meaningful masking units — treating named entities, concepts, and phrases as atomic units for masking rather than randomly selected subword tokens, forcing the model to learn to predict entire real-world concepts from context rather than reconstructing word fragments from adjacent characters.

The Limitation of Token-Level Masking

Standard BERT masks individual WordPiece subwords. When "Barack Obama" is tokenized as ["Barack", "##O", "##bam", "##a"], masking only the token "##bam" makes prediction trivial: the model reconstructs the word from visible fragments "Barack", "##O", and "##a" without learning anything meaningful about Barack Obama as a real-world entity.

Knowledge Masking addresses this by treating "Barack Obama" as a single indivisible semantic unit. When masked, all four subword tokens are replaced simultaneously, forcing the model to predict the entity's identity entirely from surrounding context: "the 44th president of the United States," "delivered his inaugural address," "former senator from Illinois." The model must learn what these contextual signals say about a specific real-world person.

ERNIE (Baidu) — The Canonical Implementation

ERNIE 1.0 (2019) from Baidu introduced three-level structured masking:

Basic-Level Masking: Random token masking identical to BERT — establishes baseline language modeling capability and recovers individual word statistics.

Phrase-Level Masking: Uses linguistic analysis (constituency parsing, POS tagging, dependency parsing) to identify multiword expressions. Masks entire phrases as units: "New York City," "machine learning," "Nobel Prize in Physics," "rate of return." The model must predict the complete phrase concept, not individual words.

Entity-Level Masking: Uses a named entity recognition (NER) system to identify entity spans. Masks all tokens of each entity simultaneously: person names, location names, organization names, product names, dates. The model predicts the entity identity from surrounding discourse.

ERNIE 1.0 demonstrated significant improvements on Chinese NLP tasks — benefits especially pronounced because Chinese text has no spaces, making character-level masking structurally similar to subword masking in English, while entity boundaries are linguistically meaningful in ways that character boundaries are not.

Knowledge Graph Integration

ERNIE 2.0 and ERNIE 3.0 extend knowledge masking to integrate structured symbolic knowledge:

- Entity Linking: Entity mentions in text are linked to their canonical entries in Wikidata, Freebase, or CNKI using an entity linking system.
- Knowledge Embeddings: Entity embeddings pre-trained on the knowledge graph (using TransE, RotatE, or ComplEx) are incorporated as additional inputs during pre-training.
- Relation Prediction: Auxiliary pre-training tasks predict the relationship between co-occurring entities: "Barack Obama" and "Harvard Law School" are linked by the "attended" relation. The model learns to reason about entity relationships, not just entity identities.
- Knowledge Fusion: A fusion layer combines token-level contextual representations from the Transformer with entity embeddings from the knowledge graph, training the model to integrate both sources of information.

Salient Span Masking (Google)

Google's T5 and related models use salient span masking: candidate spans are scored by TF-IDF across the corpus. Rare, informative spans (specific names, technical terms, unusual phrases) are selected for masking with probability proportional to their information content. Common function words and stopwords are rarely masked. This approximates entity masking without requiring an explicit NER pipeline or knowledge graph.

Comparison of Masking Strategies

| Variant | Masking Unit | External Resource Required |
|---------|-------------|---------------------------|
| Random Token | Individual subword | None |
| Whole Word | All subwords of a word | Word boundary information |
| Phrase Masking | Multiword expression | POS tagger, chunker |
| Entity Masking | Named entity span | NER system |
| Knowledge Masking | Knowledge graph entity | KG + entity linker |
| Salient Span | High-information span | TF-IDF corpus statistics |

Benefits for Downstream Tasks

Knowledge masking consistently improves performance on entity-centric tasks:

- Named Entity Recognition: Stronger entity span representations from explicit entity-level supervision.
- Relation Extraction: Predicting relationships between co-occurring entities benefits from relational pre-training.
- Knowledge-Intensive QA: Questions requiring factual recall (TriviaQA, Natural Questions, EntityQuestions) benefit from richer entity representations.
- Entity Linking: Disambiguating entity mentions to knowledge graph entries improves when entity representations are pre-trained with knowledge masking.
- Coreference Resolution: Entity identity tracking across a document benefits from entity-level representations.
- Slot Filling: Extracting structured information about entities is strengthened by entity-aware pre-training.

For Non-Latin Languages

The benefit of knowledge masking is especially strong for:
- Chinese: No word boundaries; entity boundaries are non-trivial to define purely from tokenization.
- Arabic: Morphologically rich; word forms are highly ambiguous without semantic context.
- Japanese: Mixed script (Kanji, Hiragana, Katakana) with no spaces; entity spans require semantic knowledge to identify.

Knowledge Masking is hiding concepts rather than characters — using semantic knowledge to define masking boundaries, forcing the model to learn the identity of real-world objects from context rather than reconstructing word forms from adjacent fragments.

Want to learn more?