Cloze Task is the psycholinguistic and reading comprehension assessment where participants fill in words deleted from a text — the direct intellectual ancestor of masked language modeling (MLM) that was formalized by Wilson Taylor in 1953 and scaled by BERT into the most influential self-supervised pre-training objective in modern NLP.
Historical Origins
Wilson L. Taylor introduced the Cloze Task in 1953 in "Cloze Procedure: A New Tool for Measuring Readability." The name derives from the Gestalt psychology concept of "closure" — the human tendency to mentally complete incomplete perceptual patterns. Taylor's insight was that a reader's ability to fill in deleted words from a text directly measures their comprehension of and familiarity with the language and content.
The original application was educational measurement: by deleting every N-th word from a passage (typically every 5th) and asking readers to fill in the blanks, readability researchers could quantify how accessible a text was to a given population without relying on subjective expert judgment.
Original Cloze Task Formats
Fixed-Ratio Deletion: Delete every 5th (or 7th, or 10th) word mechanically. Produces an objective, reproducible test. Example:
"The quick brown fox [___] over the lazy [___]. It was [___] a beautiful [___]."
Rational Deletion: Select words for deletion based on semantic importance — delete nouns and verbs preferentially over function words. More targeted but requires human judgment in test construction.
Exact-Word Scoring: Only the original deleted word counts as correct. Strict, reliable, but penalizes synonyms that preserve meaning equally well.
Acceptable-Word Scoring: Any contextually appropriate word counts as correct. More generous and arguably measures comprehension more validly than exact matching, but requires human scoring.
The Bridge to Machine Learning: Pre-BERT Applications
Cloze format appeared in ML contexts before BERT. Key milestones:
Children's Book Test (CBT, 2015): Created from Project Gutenberg children's books. Questions ask models to choose the correct word (from 10 candidates) to fill a blank in a passage read aloud. Separate evaluations for named entities, common nouns, verbs, and prepositions allowed dissecting what types of context different model architectures could leverage.
CNN/Daily Mail Reading Comprehension (2015): Reformulated news article bullet-point summaries as cloze items over anonymized entity mentions — replacing named entities with placeholder symbols (Entity123) to prevent simple lookup. Established reading comprehension as a tractable ML benchmark using automatic cloze construction from existing editorial structure.
LAMBADA (2016): Predict the final word of a passage where the correct prediction requires understanding the entire preceding narrative context, not just the immediately preceding sentence. Specifically curated to require document-level comprehension rather than local context.
BERT and the Industrialization of Cloze
BERT (Devlin et al., 2018) transformed the cloze task from an evaluation tool into a training objective, scaling it to billions of examples:
- Scale: Applied to the entirety of English Wikipedia (2.5 billion words) plus BooksCorpus (0.8 billion words).
- Automated Supervision: No human readers needed — the model generates its own supervision by randomly masking tokens and predicting them against the original.
- 15% Random Masking with Three Variants:
- 80% → replaced with [MASK] token (standard prediction).
- 10% → replaced with a random vocabulary token (forces model to maintain non-masked token representations).
- 10% → left unchanged (prevents model from assuming all [MASK] positions are the target).
- Bidirectionality: BERT reads the entire context simultaneously, using both left and right context to fill each blank. This makes the task strictly harder than left-to-right language modeling (GPT) and produces richer representations for understanding.
Human Cloze vs. MLM: Key Differences
| Aspect | Taylor's Cloze (1953) | BERT MLM |
|--------|----------------------|----------|
| Deletion method | Every N-th word | Random 15% |
| Target focus | Content words (semantic) | All tokens including function words |
| Context window | Full document | 512-token window |
| Scale | Hundreds of sentences | Billions of tokens |
| Evaluation | Human judgment | Cross-entropy loss |
| Purpose | Readability measurement | Representation learning |
| Directionality | Sequential reading | Fully bidirectional |
Zero-Shot Evaluation via Cloze Format
Cloze format enables zero-shot evaluation of language models for factual knowledge:
The LAMA benchmark converts knowledge graph triples into cloze questions:
- "The capital of France is [MASK]." → Expected: "Paris."
- "Barack Obama was born in [MASK]." → Expected: "Honolulu."
- "Penicillin was discovered by [MASK]." → Expected: "Fleming."
By measuring the probability a language model assigns to the correct answer vs. competitors in cloze format, researchers assess how much factual world knowledge was encoded during pre-training — without any fine-tuning or in-context examples.
Cloze in Major NLP Benchmarks
- Children's Book Test: Entity and common noun prediction in narrative text.
- ReCoRD (SuperGLUE): Cloze over CNN/DailyMail news articles requiring commonsense reasoning.
- LAMBADA: Final-word prediction requiring document-level narrative comprehension.
- Winograd Schema Challenge: Binary cloze with pronoun resolution requiring commonsense reasoning to distinguish referents.
- SWAG / HellaSwag: Sentence completion from multiple choices requiring commonsense inference about likely continuations.
Cloze Task is the 1950s classroom exercise that became the foundation of modern language model pre-training — a fill-in-the-blank procedure designed to measure human reading comprehension that, when scaled to billions of examples with bidirectional context, teaches neural networks the statistical and semantic structure of natural language.