Home Knowledge Base NarrativeQA (Long)

NarrativeQA (Long) is the full-document variant of the NarrativeQA benchmark — requiring models to read entire movie scripts or Gutenberg novels averaging 50,000-80,000 words to answer free-form questions, representing the frontier challenge of long-document comprehension where the answer may be embedded anywhere in a text far exceeding the context window of standard models.

What Is NarrativeQA?

Why the Long Version Is Hard

The "Long" setting — using the full book or script rather than a summary — exposes three fundamental challenges:

Challenge 1 — Context Window Overflow:

Challenge 2 — Holistic Understanding:

Challenge 3 — Needle in a Haystack:

Performance Results

ModelSettingROUGE-LBLEU-1METEOR
SeqToSeq baselineSummary28.523.821.5
BiDAFSummary36.633.728.7
GPT-3.5Full text (RAG)42.138.433.2
GPT-4Full text (RAG)52.348.141.6
Claude 2 100kFull text (no retrieval)59.454.848.3
HumanSummary67.062.955.8

Evaluation Metrics

NarrativeQA uses three complementary metrics because answers are free-form and often have multiple valid phrasings:

Why NarrativeQA (Long) Matters

Famous "Needle in a Haystack" Test Connection

The NarrativeQA long setting directly inspired the "Needle in a Haystack" evaluation (Kamradt, 2023) — placing a specific fact anywhere in a 100k-token document and testing whether the model can retrieve it. NarrativeQA long is the naturalistic version of this synthetic test.

NarrativeQA (Long) is consuming the novel — the frontier benchmark of truly long-form document comprehension, where genuine understanding requires reading and integrating an entire book rather than finding and extracting a relevant passage.

narrativeqa longevaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.