Entity Prediction

Entity Prediction is the pre-training or auxiliary training task where the model must identify, classify, or link named entities in text — explicitly supervising entity-level understanding beyond the general masked language modeling objective, producing representations that encode the identity and type of real-world objects named in text rather than just distributional word co-occurrence statistics.

What Constitutes a Named Entity

Named entities are real-world objects with consistent proper names that can be referenced across documents:
- Person: Barack Obama, Marie Curie, Elon Musk.
- Organization: Google, United Nations, Stanford University.
- Location: Paris, Mount Everest, the Pacific Ocean.
- Date/Time: January 1, 2024; the 20th century; Q3 earnings.
- Product: iPhone 15, NVIDIA H100, GPT-4.
- Event: World War II, the 2024 Olympics, the French Revolution.

Standard language model pre-training treats these entities identically to common words — the token "Obama" receives the same training signal as "quickly" or "the." Entity prediction tasks force the model to develop specialized representations for real-world referents with consistent global identities.

Task Formulations

Named Entity Recognition (NER) as Pre-training Objective: At each position, predict the entity type label (B-PER, I-PER, B-ORG, I-ORG, O using BIO tagging) in addition to or instead of the masked token. Trains the model to identify entity spans and types without explicit supervision on downstream NER tasks, enabling strong zero-shot NER transfer.

Entity Typing: Given an identified entity mention span, predict its fine-grained type from a large type ontology. Ultra-Fine Entity Typing (UFET) uses thousands of types derived from Wikidata relations (e.g., /person/politician/president, /organization/company/tech_company, /location/city/capital). Fine-grained typing requires integrating context and world knowledge.

Entity Linking / Disambiguation: Given the text "Apple released a new product," link "Apple" to either the company (Wikidata Q312) or the fruit (Q89) based on context. Entity linking requires simultaneously understanding the linguistic context and the knowledge graph structure of candidate entities. The model must disambiguate between thousands of candidate entities sharing the same surface form.

Entity Slot Filling (LAMA Probing): Given a template "Barack Obama was born in [MASK]," predict the entity that fills the slot. Tests factual recall encoded in model parameters — knowledge acquired during pre-training rather than provided in context. The LAMA benchmark uses such templates to assess how much structured world knowledge language models implicitly store.

LUKE — The Entity-Centric Architecture

LUKE (Language Understanding with Knowledge-based Embeddings, 2020) provides the canonical implementation of entity prediction as pre-training:

- Input Representation: Text tokens from standard tokenization + entity spans identified by linking Wikipedia anchor texts.
- Entity Embedding Table: A separate embedding table for 500,000 Wikipedia entities, updated during pre-training alongside word embeddings.
- Dual Masking Objective: At each training step, independently mask some word tokens (standard MLM) and some entity spans (entity prediction task).
- Entity Prediction: Predict masked entity identities from surrounding textual context and visible entity context.
- Extended Self-Attention: Modified attention mechanism handles word-word, word-entity, and entity-entity attention pairs simultaneously, allowing the model to reason about relationships between multiple entities in the same passage.

LUKE achieved state-of-the-art on entity-centric tasks including NER, relation extraction, entity typing, entity linking, and reading comprehension at time of publication, demonstrating that explicit entity supervision substantially improves entity-centric downstream performance.

ERNIE (Tsinghua) — Knowledge Graph Integration

ERNIE from Tsinghua University (distinct from Baidu's ERNIE) integrates entity knowledge through a knowledge fusion architecture:

- Dual Encoder: Separate text encoder (BERT-based) and entity encoder (trained on knowledge graph triples using TransE).
- Fusion Layer: Combines token-level representations with entity embeddings by projecting both into a shared semantic space.
- Denoising Objective: Predicts entity-text alignments that have been deliberately corrupted, forcing the model to learn correct entity-context associations.
- Entity Alignment: Aligns entity mentions in text with knowledge graph entries through named entity linking during pre-training.

Benefits Across Downstream Tasks

| Task | How Entity Prediction Helps |
|------|-----------------------------|
| Named Entity Recognition | Model already encodes entity spans and type categories |
| Relation Extraction | Entity embeddings encode relational context from KG |
| Entity Linking | Pre-trained disambiguation reduces fine-tuning data needs |
| Open-Domain QA | Factual entities are directly recalled from parameters |
| Coreference Resolution | Entity identity is explicitly represented across mentions |
| Slot Filling | Template-based entity recall is strengthened |
| Information Extraction | Structured fact extraction benefits from entity awareness |

Complementarity with MLM

MLM and entity prediction are complementary objectives. MLM teaches syntactic structure, function word usage, and local distributional semantics. Entity prediction teaches that specific spans refer to real-world objects with consistent identities across documents and across time. Together, they produce models that understand both language structure and world knowledge — the combination essential for knowledge-intensive NLP tasks where factual accuracy matters.

Entity Prediction is teaching the model who's who — explicitly supervising the model to identify, classify, and link the real-world objects named in text, building the factual knowledge base that pure distributional learning from token co-occurrence statistics cannot provide.

Want to learn more?