MultiRC (Multi-Sentence Reading Comprehension) is the reading comprehension benchmark where questions may have multiple correct answers and answering requires integrating evidence from multiple non-adjacent sentences — challenging the single-span, single-sentence assumptions of SQuAD and testing a model's ability to perform comprehensive, multi-evidence reasoning across an entire passage.
Design Motivations
MultiRC was designed to address two specific limitations of SQuAD and similar reading comprehension benchmarks:
Single-Span Assumption: SQuAD answers are always contiguous text spans. Many real questions have answers that are non-contiguous, require synthesis, or have multiple valid answer components. "What were the causes of World War I?" cannot be answered by a single span.
Single-Sentence Evidence: Most SQuAD questions can be answered from a single sentence in the passage. MultiRC specifically selects questions requiring evidence integration across multiple non-adjacent sentences — testing paragraph-level comprehension rather than sentence-level retrieval.
Task Format
MultiRC uses a multi-label binary classification format:
Passage: A multi-paragraph document (500–1000 words).
Question: "Which of the following contributed to the outcome?"
Answer Choices: 5–7 candidate answers, each labeled True or False independently.
Task: For each candidate answer, predict True or False (multiple correct answers possible).
Example:
Question: "What were the effects of the economic crisis?"
Choices:
(a) "Unemployment rose sharply." → True ✓
(b) "Inflation decreased." → False ✗
(c) "Several banks failed." → True ✓
(d) "GDP growth accelerated." → False ✗
(e) "Government spending increased." → True ✓
The model must verify each candidate independently. Getting (a) correct does not imply getting (e) correct — each requires finding and evaluating different evidence in the passage.
Dataset Construction
- Source: Diverse text genres including news, fiction, historical texts, biomedical abstracts, and elementary science articles.
- Question writing: Human annotators were instructed to write questions that require reading multiple sentences from the passage.
- Answer writing: Multiple candidates per question, mix of correct and incorrect answers.
- Scale: 6,000+ questions across 800 passages; each question has 5–9 answer candidates.
- Human performance: ~86% F1m (macro-averaged F1), ~86% EM.
Evaluation Metrics
MultiRC requires specialized metrics because standard accuracy and F1 do not account for its multi-label structure:
Exact Match (EM): A question is correctly answered only if ALL answer candidates for that question are correctly classified. Very strict — getting 4 out of 5 candidates correct on a question counts as 0 correct.
F1m (Macro-Averaged F1): For each question, compute binary classification F1 (treating True labels as positive and False labels as negative). Average F1 across all questions. More forgiving than EM and the primary metric. Rewards partial credit for partially correct multi-label predictions.
F1a (Micro-Averaged F1): Compute F1 across all individual answer candidate classifications, regardless of question boundaries. Useful for diagnosing specific types of classification errors.
Why MultiRC Is Harder than SQuAD
No Span Extraction: Models cannot rely on locating a highlighted span; they must evaluate free-form candidate answer strings against passage evidence.
Multi-Label Complexity: The model must identify ALL correct answers, not just the single best answer. Missing one correct answer or including one incorrect answer counts against performance.
Multi-Sentence Evidence: Evidence for a single answer candidate may require:
- Reading an initial fact from paragraph 1.
- Connecting it to a qualification in paragraph 3.
- Comparing against a counterexample in paragraph 2.
This requires genuine long-range comprehension, not just sentence-level retrieval.
Distractor Quality: Incorrect answer candidates are plausibly related to the question topic, requiring the model to distinguish relevant from irrelevant facts.
MultiRC in SuperGLUE
MultiRC is one of eight SuperGLUE tasks. Its F1m score contributes to the overall SuperGLUE aggregate. Models that perform well on single-sentence, single-answer tasks (like BoolQ) often struggle on MultiRC due to the multi-label complexity:
| Model | MultiRC F1m |
|-------|------------|
| BERT-large baseline | 70.0 |
| RoBERTa-large | 84.4 |
| ALBERT-xxlarge | 87.4 |
| Human | 86.4 |
ALBERT-xxlarge surpasses human performance on MultiRC F1m — but human Exact Match is much harder to surpass, as humans are more consistent across all answer candidates within a question.
Multi-Evidence Retrieval Challenge
MultiRC motivates research in multi-hop reading comprehension — the ability to chain evidence from multiple text locations to reach a conclusion:
- Attention Visualization: MultiRC reveals that correct answers require attention patterns spanning multiple paragraphs, not just local context.
- Graph-Based Reasoning: Some approaches model MultiRC as a graph problem: passage sentences are nodes, semantic relationships are edges, and reasoning paths trace from question to evidence to answer.
- Retrieval-Augmented Models: MultiRC motivates passage-level retrieval before span-level reasoning — first identify the relevant sentences, then evaluate each candidate against those sentences.
MultiRC is the "select all that apply" reading test — a benchmark that forces comprehensive multi-evidence reading rather than single-span retrieval, evaluating whether models can verify multiple independent claims against complex multi-paragraph passages simultaneously.