Cloze Task

Cloze Task is the psycholinguistic and reading comprehension assessment where participants fill in words deleted from a text — the direct intellectual ancestor of masked language modeling (MLM) that was formalized by Wilson Taylor in 1953 and scaled by BERT into the most influential self-supervised pre-training objective in modern NLP.

Historical Origins

Wilson L. Taylor introduced the Cloze Task in 1953 in "Cloze Procedure: A New Tool for Measuring Readability." The name derives from the Gestalt psychology concept of "closure" — the human tendency to mentally complete incomplete perceptual patterns. Taylor's insight was that a reader's ability to fill in deleted words from a text directly measures their comprehension of and familiarity with the language and content.

The original application was educational measurement: by deleting every N-th word from a passage (typically every 5th) and asking readers to fill in the blanks, readability researchers could quantify how accessible a text was to a given population without relying on subjective expert judgment.

Original Cloze Task Formats

Fixed-Ratio Deletion: Delete every 5th (or 7th, or 10th) word mechanically. Produces an objective, reproducible test. Example:
"The quick brown fox [___] over the lazy [___]. It was [___] a beautiful [___]."

Rational Deletion: Select words for deletion based on semantic importance — delete nouns and verbs preferentially over function words. More targeted but requires human judgment in test construction.

Exact-Word Scoring: Only the original deleted word counts as correct. Strict, reliable, but penalizes synonyms that preserve meaning equally well.

Acceptable-Word Scoring: Any contextually appropriate word counts as correct. More generous and arguably measures comprehension more validly than exact matching, but requires human scoring.

The Bridge to Machine Learning: Pre-BERT Applications

Cloze format appeared in ML contexts before BERT. Key milestones:

Children's Book Test (CBT, 2015): Created from Project Gutenberg children's books. Questions ask models to choose the correct word (from 10 candidates) to fill a blank in a passage read aloud. Separate evaluations for named entities, common nouns, verbs, and prepositions allowed dissecting what types of context different model architectures could leverage.

CNN/Daily Mail Reading Comprehension (2015): Reformulated news article bullet-point summaries as cloze items over anonymized entity mentions — replacing named entities with placeholder symbols (Entity123) to prevent simple lookup. Established reading comprehension as a tractable ML benchmark using automatic cloze construction from existing editorial structure.

LAMBADA (2016): Predict the final word of a passage where the correct prediction requires understanding the entire preceding narrative context, not just the immediately preceding sentence. Specifically curated to require document-level comprehension rather than local context.

BERT and the Industrialization of Cloze

BERT (Devlin et al., 2018) transformed the cloze task from an evaluation tool into a training objective, scaling it to billions of examples:

- Scale: Applied to the entirety of English Wikipedia (2.5 billion words) plus BooksCorpus (0.8 billion words).
- Automated Supervision: No human readers needed — the model generates its own supervision by randomly masking tokens and predicting them against the original.
- 15% Random Masking with Three Variants:
- 80% → replaced with [MASK] token (standard prediction).
- 10% → replaced with a random vocabulary token (forces model to maintain non-masked token representations).
- 10% → left unchanged (prevents model from assuming all [MASK] positions are the target).
- Bidirectionality: BERT reads the entire context simultaneously, using both left and right context to fill each blank. This makes the task strictly harder than left-to-right language modeling (GPT) and produces richer representations for understanding.

Human Cloze vs. MLM: Key Differences

| Aspect | Taylor's Cloze (1953) | BERT MLM |
|--------|----------------------|----------|
| Deletion method | Every N-th word | Random 15% |
| Target focus | Content words (semantic) | All tokens including function words |
| Context window | Full document | 512-token window |
| Scale | Hundreds of sentences | Billions of tokens |
| Evaluation | Human judgment | Cross-entropy loss |
| Purpose | Readability measurement | Representation learning |
| Directionality | Sequential reading | Fully bidirectional |

Zero-Shot Evaluation via Cloze Format

Cloze format enables zero-shot evaluation of language models for factual knowledge:

The LAMA benchmark converts knowledge graph triples into cloze questions:
- "The capital of France is [MASK]." → Expected: "Paris."
- "Barack Obama was born in [MASK]." → Expected: "Honolulu."
- "Penicillin was discovered by [MASK]." → Expected: "Fleming."

By measuring the probability a language model assigns to the correct answer vs. competitors in cloze format, researchers assess how much factual world knowledge was encoded during pre-training — without any fine-tuning or in-context examples.

Cloze in Major NLP Benchmarks

- Children's Book Test: Entity and common noun prediction in narrative text.
- ReCoRD (SuperGLUE): Cloze over CNN/DailyMail news articles requiring commonsense reasoning.
- LAMBADA: Final-word prediction requiring document-level narrative comprehension.
- Winograd Schema Challenge: Binary cloze with pronoun resolution requiring commonsense reasoning to distinguish referents.
- SWAG / HellaSwag: Sentence completion from multiple choices requiring commonsense inference about likely continuations.

Cloze Task is the 1950s classroom exercise that became the foundation of modern language model pre-training — a fill-in-the-blank procedure designed to measure human reading comprehension that, when scaled to billions of examples with bidirectional context, teaches neural networks the statistical and semantic structure of natural language.

Want to learn more?