ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset) is the reading comprehension benchmark included in SuperGLUE — consisting of over 120,000 news article passages from CNN and Daily Mail paired with cloze-style queries requiring commonsense reasoning to identify the correct named entity answer, representing the hardest reading comprehension task in the SuperGLUE suite.
Task Format and Structure
ReCoRD presents:
- Passage: A CNN or Daily Mail news article passage.
- Query: A question about the passage with one or more answer slots marked as @placeholder.
- Entity List: All named entities mentioned in the passage (serving as the candidate answer set).
- Task: Select the entity from the passage that correctly fills the @placeholder in the query.
Example:
Passage: "The government announced a new stimulus package worth $1.9 trillion. Treasury Secretary Janet Yellen defended the plan before Congress. Senate Republicans expressed opposition, arguing the package was too large."
Query: "@placeholder defended the economic relief plan before the legislature."
Entities: {government, stimulus package, Janet Yellen, Congress, Senate Republicans}
Answer: Janet Yellen.
Unlike SQuAD (where answers are arbitrary text spans), ReCoRD restricts answers to named entities appearing in the passage. Unlike MCQ benchmarks with fixed distractors, the entity candidate set is derived from the passage itself, making the task more naturalistic and harder.
Construction Methodology
ReCoRD was constructed from CNN/Daily Mail summary bullets:
- CNN and Daily Mail articles contain editorial highlight bullets summarizing key facts.
- Highlight sentences were converted to cloze queries by removing one named entity mention.
- The removed entity becomes the correct answer.
- All other named entities in the article become distractors.
This construction ensures queries are genuine summaries of key article facts rather than artificially constructed questions. It also means the answer requires understanding which entity in a complex news story plays the role described in the summary.
Why ReCoRD Requires Commonsense Reasoning
Unlike SQuAD where keyword matching often reveals the answer span, ReCoRD queries frequently use paraphrases, pronouns, or different phrasings from the passage:
- Passage: "Yellen defended the plan before Congress."
- Query: "@placeholder defended the economic relief plan before the legislature."
- "legislature" paraphrases "Congress"; "economic relief plan" paraphrases "stimulus package."
The model must understand that "legislature" means Congress and map the query description to the correct passage sentence. Naive keyword matching fails because query and passage use different vocabulary.
Additionally, many ReCoRD queries are genuinely ambiguous without world knowledge:
- "@placeholder signed the trade agreement with China." — Multiple world leaders might plausibly be the signatory; the model must read the passage carefully to identify which one.
Evaluation Metrics
ReCoRD is evaluated using:
- Exact Match (EM): Fraction of predictions exactly matching the ground truth entity string (normalized).
- Token-level F1: Partial credit for predictions sharing tokens with the ground truth, handling multi-word entity names.
Human performance: ~91.3 EM / ~91.7 F1.
Top models (2021): ~91–92 EM, approaching human performance on this task.
ReCoRD in SuperGLUE
ReCoRD is one of the eight SuperGLUE tasks and consistently among the hardest for early SuperGLUE-era models:
| Model | ReCoRD F1 |
|-------|----------|
| BERT-large baseline | 71.3 |
| RoBERTa-large | 90.0 |
| ALBERT-xxlarge | 91.4 |
| Human | 91.7 |
The rapid improvement from BERT (71.3) to RoBERTa (90.0) reflects how strongly ReCoRD benefits from improved pre-training: larger pre-training corpora covering news text directly helps with news article reading comprehension. Models that include CNN/DailyMail in pre-training see dramatic improvements.
Relationship to CNN/Daily Mail Dataset
ReCoRD is the "hard version" of the CNN/Daily Mail reading comprehension dataset introduced in 2015. The original CNN/Daily Mail dataset used entity anonymization (replacing named entities with placeholders like Entity123) and was criticized for being solvable by simple matching heuristics. ReCoRD preserves real entity names and requires genuine comprehension and commonsense inference, addressing the original dataset's limitations.
Why Entity-Constrained Cloze Is Challenging
The entity-constrained answer space creates a specific challenge: the model must:
1. Parse the query to understand what type of entity is being asked about (a person? a law? an organization?).
2. Identify which passage sentences describe that type of entity doing the described action.
3. Select among multiple passage entities of the same type (multiple politicians mentioned, multiple organizations).
Step 3 is especially difficult when multiple entities could plausibly fill the role — requiring fine-grained passage comprehension rather than rough topic matching.
Applications
ReCoRD-style tasks mirror real-world applications in:
- News Summarization: Extracting key entity-action facts from articles.
- Information Extraction: Populating knowledge bases from news with entity-attribute-value triples.
- Question Answering over News: Answering factual questions about recent events requires the same passage comprehension + entity identification skills.
ReCoRD is news reading with entity-level comprehension — a benchmark that tests whether models can extract specific factual claims from journalistic prose, identify the correct entity among multiple plausible candidates, and bridge the paraphrase gap between query formulations and passage content.