Home Knowledge Base DuoRC

DuoRC is a reading comprehension benchmark built from pairs of semantically equivalent but lexically different movie plot summaries, designed to test whether a QA system can answer questions when the wording in the evidence passage differs substantially from the wording used in the question. Released by Saha et al. in 2018, DuoRC exposed a major weakness in earlier machine reading models: many systems appeared strong on benchmarks like SQuAD because they relied on lexical overlap and span matching, but failed when required to generalize across paraphrase, abstraction, and different narrative style.

How DuoRC Is Constructed

The dataset uses two plot summaries for the same movie:

Annotators read one version and write questions, while the model must answer using the other version. That means:

This breaks the shortcut used by many extractive QA models: scanning for keyword overlap and copying a span.

Two Main Tasks in DuoRC

SettingDescriptionDifficulty
SelfRCQuestion and answer evidence come from the same plot versionEasier
ParaphraseRCQuestion written from one plot version, answer from the otherHarder and more realistic

SelfRC is similar to conventional reading comprehension. ParaphraseRC is the real contribution because it forces semantic matching rather than string matching.

Example of the Core Challenge

Suppose the IMDb plot says:

And the Wikipedia plot says:

A question written from one version such as "Who turns out to be responsible for the murder?" may require reasoning over descriptions that use entirely different words. A shallow span-matching system will fail even though the story content is the same.

Why DuoRC Mattered Historically

When DuoRC was introduced, it highlighted three important facts:

1. Lexical overlap had inflated benchmark performance: Systems scoring well on SQuAD were often exploiting answer-style artifacts and phrase matching 2. Semantic understanding is much harder: Real-world documents rarely restate the same fact in identical wording 3. Machine reading needed retrieval plus reasoning plus paraphrase robustness: Not just extraction

This helped push the field toward models with stronger contextual understanding, pretraining, and eventually LLM-based QA systems.

Model Performance and Evolution

Early neural QA models struggled badly on ParaphraseRC:

Pretrained transformers improved results:

However, DuoRC remains useful as a diagnostic benchmark because it measures robustness to rewording, which still matters in production QA and RAG systems.

Why DuoRC Still Matters for Production AI

Modern enterprise QA systems face the DuoRC problem constantly:

If a model only works when wording matches exactly, it is not useful in production. DuoRC is therefore a good benchmark for semantic retrieval and reading systems.

Relation to Other Benchmarks

BenchmarkWhat It TestsMain Weakness Addressed by DuoRC
SQuADSpan extraction from same passageHigh lexical overlap
NarrativeQALong-form story understandingHard but not explicitly paraphrase-focused
HotpotQAMulti-hop reasoningRequires evidence combination, less paraphrase emphasis
DuoRCSemantic generalization across rewritten source textsDirectly penalizes word-matching shortcuts

Limitations

DuoRC remains important because it captured a core truth about language understanding early: answering questions is easy when the answer is copied verbatim, but much harder when the same meaning is expressed in different words. That distinction is central to evaluating any serious machine reading or retrieval-augmented AI system.

duorcreading comprehension benchmarkqa datasetsemantic generalizationmachine reading evaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.