DuoRC

DuoRC is a reading comprehension benchmark built from pairs of semantically equivalent but lexically different movie plot summaries, designed to test whether a QA system can answer questions when the wording in the evidence passage differs substantially from the wording used in the question. Released by Saha et al. in 2018, DuoRC exposed a major weakness in earlier machine reading models: many systems appeared strong on benchmarks like SQuAD because they relied on lexical overlap and span matching, but failed when required to generalize across paraphrase, abstraction, and different narrative style.

How DuoRC Is Constructed

The dataset uses two plot summaries for the same movie:
- Wikipedia plot: Usually concise, cleaner, and more encyclopedic
- IMDb plot: Often longer, more narrative, and written in different wording

Annotators read one version and write questions, while the model must answer using the other version. That means:
- Key entities may be described differently
- Event order may be compressed or rephrased
- Specific words from the question may never appear in the target passage

This breaks the shortcut used by many extractive QA models: scanning for keyword overlap and copying a span.

Two Main Tasks in DuoRC

| Setting | Description | Difficulty |
|--------|-------------|------------|
| SelfRC | Question and answer evidence come from the same plot version | Easier |
| ParaphraseRC | Question written from one plot version, answer from the other | Harder and more realistic |

SelfRC is similar to conventional reading comprehension. ParaphraseRC is the real contribution because it forces semantic matching rather than string matching.

Example of the Core Challenge

Suppose the IMDb plot says:
- "A grieving detective tracks a suspect across several cities before discovering the killer is someone close to him."

And the Wikipedia plot says:
- "The investigator follows leads nationwide and eventually learns that the murderer is a trusted associate."

A question written from one version such as "Who turns out to be responsible for the murder?" may require reasoning over descriptions that use entirely different words. A shallow span-matching system will fail even though the story content is the same.

Why DuoRC Mattered Historically

When DuoRC was introduced, it highlighted three important facts:

1. Lexical overlap had inflated benchmark performance: Systems scoring well on SQuAD were often exploiting answer-style artifacts and phrase matching
2. Semantic understanding is much harder: Real-world documents rarely restate the same fact in identical wording
3. Machine reading needed retrieval plus reasoning plus paraphrase robustness: Not just extraction

This helped push the field toward models with stronger contextual understanding, pretraining, and eventually LLM-based QA systems.

Model Performance and Evolution

Early neural QA models struggled badly on ParaphraseRC:
- Span-based BiDAF and Match-LSTM systems saw steep drops relative to easier QA datasets
- Even with answer generation or span-ranking variants, the performance gap remained substantial

Pretrained transformers improved results:
- BERT/RoBERTa: Better contextual matching and paraphrase sensitivity
- T5/UnifiedQA: Stronger text-to-text formulation for QA
- GPT-4/Claude/Gemini era: Frontier LLMs perform dramatically better because they bring large-scale world knowledge, paraphrase handling, and latent narrative reasoning

However, DuoRC remains useful as a diagnostic benchmark because it measures robustness to rewording, which still matters in production QA and RAG systems.

Why DuoRC Still Matters for Production AI

Modern enterprise QA systems face the DuoRC problem constantly:
- A customer asks a support question using different phrasing than the knowledge base article
- A lawyer asks about a clause using plain English while the contract uses dense formal language
- An engineer asks about a hardware failure mode using a shorthand term not used in the official incident report

If a model only works when wording matches exactly, it is not useful in production. DuoRC is therefore a good benchmark for semantic retrieval and reading systems.

Relation to Other Benchmarks

| Benchmark | What It Tests | Main Weakness Addressed by DuoRC |
|-----------|---------------|----------------------------------|
| SQuAD | Span extraction from same passage | High lexical overlap |
| NarrativeQA | Long-form story understanding | Hard but not explicitly paraphrase-focused |
| HotpotQA | Multi-hop reasoning | Requires evidence combination, less paraphrase emphasis |
| DuoRC | Semantic generalization across rewritten source texts | Directly penalizes word-matching shortcuts |

Limitations

- Movie plots are a narrow domain, so domain transfer is limited
- Some plot summaries omit details, making certain questions genuinely unanswerable
- Benchmark size is smaller than modern large-scale QA evaluations
- Frontier LLMs can now solve much of DuoRC, so it is less discriminative than in 2018

DuoRC remains important because it captured a core truth about language understanding early: answering questions is easy when the answer is copied verbatim, but much harder when the same meaning is expressed in different words. That distinction is central to evaluating any serious machine reading or retrieval-augmented AI system.

Want to learn more?