Home Knowledge Base ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset)

ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset) is the reading comprehension benchmark included in SuperGLUE — consisting of over 120,000 news article passages from CNN and Daily Mail paired with cloze-style queries requiring commonsense reasoning to identify the correct named entity answer, representing the hardest reading comprehension task in the SuperGLUE suite.

Task Format and Structure

ReCoRD presents:

Example: Passage: "The government announced a new stimulus package worth $1.9 trillion. Treasury Secretary Janet Yellen defended the plan before Congress. Senate Republicans expressed opposition, arguing the package was too large."

Query: "@placeholder defended the economic relief plan before the legislature."

Entities: {government, stimulus package, Janet Yellen, Congress, Senate Republicans}

Answer: Janet Yellen.

Unlike SQuAD (where answers are arbitrary text spans), ReCoRD restricts answers to named entities appearing in the passage. Unlike MCQ benchmarks with fixed distractors, the entity candidate set is derived from the passage itself, making the task more naturalistic and harder.

Construction Methodology

ReCoRD was constructed from CNN/Daily Mail summary bullets:

This construction ensures queries are genuine summaries of key article facts rather than artificially constructed questions. It also means the answer requires understanding which entity in a complex news story plays the role described in the summary.

Why ReCoRD Requires Commonsense Reasoning

Unlike SQuAD where keyword matching often reveals the answer span, ReCoRD queries frequently use paraphrases, pronouns, or different phrasings from the passage:

The model must understand that "legislature" means Congress and map the query description to the correct passage sentence. Naive keyword matching fails because query and passage use different vocabulary.

Additionally, many ReCoRD queries are genuinely ambiguous without world knowledge:

Evaluation Metrics

ReCoRD is evaluated using:

Human performance: ~91.3 EM / ~91.7 F1. Top models (2021): ~91–92 EM, approaching human performance on this task.

ReCoRD in SuperGLUE

ReCoRD is one of the eight SuperGLUE tasks and consistently among the hardest for early SuperGLUE-era models:

ModelReCoRD F1
BERT-large baseline71.3
RoBERTa-large90.0
ALBERT-xxlarge91.4
Human91.7

The rapid improvement from BERT (71.3) to RoBERTa (90.0) reflects how strongly ReCoRD benefits from improved pre-training: larger pre-training corpora covering news text directly helps with news article reading comprehension. Models that include CNN/DailyMail in pre-training see dramatic improvements.

Relationship to CNN/Daily Mail Dataset

ReCoRD is the "hard version" of the CNN/Daily Mail reading comprehension dataset introduced in 2015. The original CNN/Daily Mail dataset used entity anonymization (replacing named entities with placeholders like Entity123) and was criticized for being solvable by simple matching heuristics. ReCoRD preserves real entity names and requires genuine comprehension and commonsense inference, addressing the original dataset's limitations.

Why Entity-Constrained Cloze Is Challenging

The entity-constrained answer space creates a specific challenge: the model must: 1. Parse the query to understand what type of entity is being asked about (a person? a law? an organization?). 2. Identify which passage sentences describe that type of entity doing the described action. 3. Select among multiple passage entities of the same type (multiple politicians mentioned, multiple organizations).

Step 3 is especially difficult when multiple entities could plausibly fill the role — requiring fine-grained passage comprehension rather than rough topic matching.

Applications

ReCoRD-style tasks mirror real-world applications in:

ReCoRD is news reading with entity-level comprehension — a benchmark that tests whether models can extract specific factual claims from journalistic prose, identify the correct entity among multiple plausible candidates, and bridge the paraphrase gap between query formulations and passage content.

recordevaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.