NarrativeQA (Long) is the full-document variant of the NarrativeQA benchmark β requiring models to read entire movie scripts or Gutenberg novels averaging 50,000-80,000 words to answer free-form questions, representing the frontier challenge of long-document comprehension where the answer may be embedded anywhere in a text far exceeding the context window of standard models.
What Is NarrativeQA?
- Origin: KoΔiskΓ½ et al. (2018) from DeepMind.
- Scale: 1,567 stories (783 books + 789 movie scripts) with 46,765 question-answer pairs.
- Format: Each story has ~30 questions; answers are free-form text (averaging ~4 words), not multiple-choice.
- Answer Source: Questions were written by human annotators who read only the plot summary β ensuring questions probe deep story understanding, not surface pattern matching.
- Two Evaluation Variants: Context = summary (~700 words) OR context = full text (~50,000-80,000 words).
Why the Long Version Is Hard
The "Long" setting β using the full book or script rather than a summary β exposes three fundamental challenges:
Challenge 1 β Context Window Overflow:
- Most transformer models cap at 4k-8k tokens (~3k-6k words). A 60,000-word novel = ~80,000 tokens.
- Solutions: RAG (retrieve relevant passages), sliding window attention, hierarchical summarization, or very long context models (Claude 100k, Gemini 1M).
Challenge 2 β Holistic Understanding:
- Some questions require synthesizing character development from chapter 1 and chapter 30: "How did [character] change throughout the story?"
- RAG retrieval of top-3 passages cannot answer these β the entire arc is needed.
Challenge 3 β Needle in a Haystack:
- Specific factual questions ("What was the name of the detective's partner's dog?") require finding a single sentence in 80,000 words.
- Retrieval can find this efficiently, but with ~5% retrieval failure rate, 5% of answers become impossible.
Performance Results
| Model | Setting | ROUGE-L | BLEU-1 | METEOR |
|-------|---------|---------|--------|--------|
| SeqToSeq baseline | Summary | 28.5 | 23.8 | 21.5 |
| BiDAF | Summary | 36.6 | 33.7 | 28.7 |
| GPT-3.5 | Full text (RAG) | 42.1 | 38.4 | 33.2 |
| GPT-4 | Full text (RAG) | 52.3 | 48.1 | 41.6 |
| Claude 2 100k | Full text (no retrieval) | 59.4 | 54.8 | 48.3 |
| Human | Summary | 67.0 | 62.9 | 55.8 |
Evaluation Metrics
NarrativeQA uses three complementary metrics because answers are free-form and often have multiple valid phrasings:
- BLEU: N-gram precision between generated answer and reference answers.
- ROUGE-L: Longest common subsequence recall.
- METEOR: Unigram recall with stemming and synonym matching.
Why NarrativeQA (Long) Matters
- Ultimate Long-Context Test: No benchmark better distinguishes models with 8k vs. 100k context windows than NarrativeQA long β the performance gap is stark and meaningful.
- Literary Understanding: Books contain subtle character psychology, narrative irony, and thematic arcs that require understanding the whole text β a genuine test of deep reading comprehension.
- Application Relevance: AI research assistants, legal discovery (reading full case files), and educational summarization all require NarrativeQA-style full-document comprehension.
- RAG Architecture Driver: NarrativeQA long motivated significant research into passage retrieval optimization, dense passage indexing, and hierarchical document representation.
- Context Utilization Research: NarrativeQA long is used to study "lost in the middle" β the finding that models best use information at the beginning and end of context, missing information in the middle of long documents.
Famous "Needle in a Haystack" Test Connection
The NarrativeQA long setting directly inspired the "Needle in a Haystack" evaluation (Kamradt, 2023) β placing a specific fact anywhere in a 100k-token document and testing whether the model can retrieve it. NarrativeQA long is the naturalistic version of this synthetic test.
NarrativeQA (Long) is consuming the novel β the frontier benchmark of truly long-form document comprehension, where genuine understanding requires reading and integrating an entire book rather than finding and extracting a relevant passage.