NarrativeQA (Long)

Keywords: narrativeqa long, evaluation

NarrativeQA (Long) is the full-document variant of the NarrativeQA benchmark β€” requiring models to read entire movie scripts or Gutenberg novels averaging 50,000-80,000 words to answer free-form questions, representing the frontier challenge of long-document comprehension where the answer may be embedded anywhere in a text far exceeding the context window of standard models.

What Is NarrativeQA?

- Origin: Kočiský et al. (2018) from DeepMind.
- Scale: 1,567 stories (783 books + 789 movie scripts) with 46,765 question-answer pairs.
- Format: Each story has ~30 questions; answers are free-form text (averaging ~4 words), not multiple-choice.
- Answer Source: Questions were written by human annotators who read only the plot summary β€” ensuring questions probe deep story understanding, not surface pattern matching.
- Two Evaluation Variants: Context = summary (~700 words) OR context = full text (~50,000-80,000 words).

Why the Long Version Is Hard

The "Long" setting β€” using the full book or script rather than a summary β€” exposes three fundamental challenges:

Challenge 1 β€” Context Window Overflow:
- Most transformer models cap at 4k-8k tokens (~3k-6k words). A 60,000-word novel = ~80,000 tokens.
- Solutions: RAG (retrieve relevant passages), sliding window attention, hierarchical summarization, or very long context models (Claude 100k, Gemini 1M).

Challenge 2 β€” Holistic Understanding:
- Some questions require synthesizing character development from chapter 1 and chapter 30: "How did [character] change throughout the story?"
- RAG retrieval of top-3 passages cannot answer these β€” the entire arc is needed.

Challenge 3 β€” Needle in a Haystack:
- Specific factual questions ("What was the name of the detective's partner's dog?") require finding a single sentence in 80,000 words.
- Retrieval can find this efficiently, but with ~5% retrieval failure rate, 5% of answers become impossible.

Performance Results

| Model | Setting | ROUGE-L | BLEU-1 | METEOR |
|-------|---------|---------|--------|--------|
| SeqToSeq baseline | Summary | 28.5 | 23.8 | 21.5 |
| BiDAF | Summary | 36.6 | 33.7 | 28.7 |
| GPT-3.5 | Full text (RAG) | 42.1 | 38.4 | 33.2 |
| GPT-4 | Full text (RAG) | 52.3 | 48.1 | 41.6 |
| Claude 2 100k | Full text (no retrieval) | 59.4 | 54.8 | 48.3 |
| Human | Summary | 67.0 | 62.9 | 55.8 |

Evaluation Metrics

NarrativeQA uses three complementary metrics because answers are free-form and often have multiple valid phrasings:
- BLEU: N-gram precision between generated answer and reference answers.
- ROUGE-L: Longest common subsequence recall.
- METEOR: Unigram recall with stemming and synonym matching.

Why NarrativeQA (Long) Matters

- Ultimate Long-Context Test: No benchmark better distinguishes models with 8k vs. 100k context windows than NarrativeQA long β€” the performance gap is stark and meaningful.
- Literary Understanding: Books contain subtle character psychology, narrative irony, and thematic arcs that require understanding the whole text β€” a genuine test of deep reading comprehension.
- Application Relevance: AI research assistants, legal discovery (reading full case files), and educational summarization all require NarrativeQA-style full-document comprehension.
- RAG Architecture Driver: NarrativeQA long motivated significant research into passage retrieval optimization, dense passage indexing, and hierarchical document representation.
- Context Utilization Research: NarrativeQA long is used to study "lost in the middle" β€” the finding that models best use information at the beginning and end of context, missing information in the middle of long documents.

Famous "Needle in a Haystack" Test Connection

The NarrativeQA long setting directly inspired the "Needle in a Haystack" evaluation (Kamradt, 2023) β€” placing a specific fact anywhere in a 100k-token document and testing whether the model can retrieve it. NarrativeQA long is the naturalistic version of this synthetic test.

NarrativeQA (Long) is consuming the novel β€” the frontier benchmark of truly long-form document comprehension, where genuine understanding requires reading and integrating an entire book rather than finding and extracting a relevant passage.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT