NarrativeQA

NarrativeQA is the reading comprehension benchmark based on full-length books and movie scripts — requiring models to answer questions about plots, characters, relationships, and themes across documents averaging 50,000+ words, making it one of the first benchmarks to genuinely require long-document comprehension and the understanding of narrative structure rather than local fact retrieval from short passages.

The Long-Document Challenge

Standard reading comprehension benchmarks use passages of 100–500 words. SQuAD paragraphs average 120 words; GLUE's RTE uses sentence pairs. These short-context benchmarks do not test whether models can track information across chapter boundaries, maintain character models over hundreds of pages, or understand how early plot events cause later consequences.

NarrativeQA addresses this gap by grounding questions in full-length narratives:
- Books: From Project Gutenberg (public domain) — novels averaging 80,000–100,000 words.
- Movie Scripts: From IMSDb (Internet Movie Script Database) — scripts averaging 20,000–40,000 words.

Answering questions about these narratives requires either processing the entire document (challenging with fixed context window models) or accurately retrieving the relevant passages from a very large candidate pool (challenging retrieval).

Dataset Construction

A key design decision distinguishes NarrativeQA from other long-document QA: questions are written based on human-written summaries of the source narratives, not the narratives themselves.

Step 1: Collect books and movie scripts with professionally written summaries (Wikipedia article summaries for books; IMSDb synopsis pages for scripts).

Step 2: Crowdworkers read the summary (not the full document) and write questions that probe the plot's key events, characters, and themes. Answers are provided in free text based on the summary.

Step 3: The QA pairs are verified against the full text to ensure the answer is findable in the original document.

This construction guarantees that questions capture genuinely important narrative content (plot summaries highlight the significant events) rather than arbitrary detail. The questions are asked about the summary but must be answered from the full text, creating a search challenge.

Task Format

- Input: Full book or movie script (50,000+ words) + question.
- Output: Free-text answer (not span extraction).
- Answer annotation: Two independent human answers per question, providing inter-annotator variation.
- Scale: 1,567 stories; 46,765 QA pairs.

The free-text answer format distinguishes NarrativeQA from SQuAD-style span extraction. Answers are evaluated using ROUGE and BLEU metrics against the reference human answers, comparing generated text to reference text rather than checking exact span matches.

Why NarrativeQA Is Challenging

Scale: No fixed-context Transformer can read 100,000 words in a single pass. The document must be chunked, retrieved, or summarized — and any of these transformations may lose the specific evidence needed to answer a given question.

Cross-Document Reasoning: Many NarrativeQA questions require connecting information from multiple distant document locations:
- "What caused the protagonist to leave his hometown?" — caused by events across the first three chapters.
- "How does the relationship between X and Y change throughout the story?" — requires evidence from beginning, middle, and end.
- "Why does the antagonist ultimately fail?" — requires understanding the whole arc.

Character Tracking: Stories involve multiple characters whose actions, relationships, and states change over the narrative. Tracking "what does Elizabeth know about Mr. Darcy at each point in the story" requires maintaining a dynamic character state model.

Temporal Reasoning: Understanding narrative requires temporal ordering: what happened before what, what were the consequences of which events. Temporal reasoning across 100,000 words is qualitatively different from reasoning over a single paragraph.

Evaluation and Benchmarks

| Model Type | NarrativeQA ROUGE-L |
|-----------|-------------------|
| Paragraph retrieval + Reading | ~36 |
| Abstractive summarization + QA | ~44 |
| Human performance | ~60 |

The large gap between models and humans reflects the genuine difficulty of long-document comprehension. Human annotators have full memory of the narrative; models must retrieve or compress the relevant information.

Retrieval-Augmented Generation for NarrativeQA

Modern approaches to NarrativeQA use RAG-style architectures:
1. Chunking: Split the document into passages of 256–512 tokens with overlap.
2. Retrieval: Use the question to retrieve the top-k relevant chunks using a dense retrieval model (DPR, ColBERT).
3. Reading: Feed retrieved chunks to a reader model to generate the answer.
4. Re-ranking: Optionally re-rank chunks by relevance to the question before reading.

The challenge: correct answers may span multiple non-adjacent passages. A single retrieved chunk may not contain sufficient evidence to answer plot-level questions.

Long-Context LLMs and NarrativeQA

GPT-4 (128k context) and Claude 3 (200k context) can ingest substantial portions of NarrativeQA documents directly. Performance improves dramatically with longer context windows:
- 4k context (chunked retrieval): ROUGE-L ~35–40.
- 32k context: ROUGE-L ~50–55.
- Full-document: ~55–65, approaching human performance on shorter documents.

NarrativeQA has become a key benchmark for evaluating long-context LLMs, as it genuinely tests whether extended context is being used effectively rather than just fitting in the window.

NarrativeQA is reading comprehension at the scale of novels — the benchmark that forces models to engage with narrative structure, character arcs, and plot causality across entire books, testing the long-range comprehension capability that separates genuine reading from local fact retrieval.

Want to learn more?