SCROLLS (Standardized CompaRison Over Long Language Sequences) is the benchmark evaluating long-context language models on realistic NLP tasks requiring processing of complete documents — unlike Long-Range Arena's synthetic sequences, SCROLLS uses real-world text: government reports, TV scripts, legal contracts, scientific papers, and books, directly measuring the practical value of extended context windows for summarization and question answering.
What Is SCROLLS?
- Origin: Shaham et al. (2022), designed to complement LRA with natural language tasks.
- Tasks: 7 NLP tasks, each requiring processing long natural language documents.
- Context Length: 1,000 to 50,000+ words per input document.
- Modality: Pure natural language — no synthetic sequences, pixels, or byte input.
- Relevance: Directly tests capabilities needed by real AI applications (legal review, medical literature, book Q&A).
The 7 SCROLLS Tasks
Summarization Tasks (4):
- GovReport: Legislative and regulatory report summarization. Documents: ~9,400 words average. Summaries: ~550 words. Source: US Government Accountability Office.
- SummScreen: TV show script summarization. Episodes range from 2,000 to 8,000 words; summaries are episode synopsis from fan wikis.
- QMSum: Meeting transcript summarization with query-based summaries — "summarize the discussion about budget constraints."
- QASPER (Summarization viewpoint): Summarize the findings of NLP papers.
QA Tasks (3):
- NarrativeQA: Questions over full books or movie scripts (20,000-80,000 words). Requires synthesizing information from the whole document.
- QASPER (QA): Answer specific questions about NLP paper content from the full paper including tables and figures.
- ContractNLI: Natural Language Inference over 50,000+ word legal contracts — determine if a contract clause entails or contradicts a general claim.
Why SCROLLS Matters
- Real-World Validation: SCROLLS demonstrates whether longer context windows translate to better task performance on text humans actually produce — not synthetic sequences.
- Context Window Arms Race Driver: SCROLLS scores directly motivated the extension from GPT-4's 8k context to Claude's 100k and then 1M context windows — each extension was justified by SCROLLS-style task improvements.
- Retrieval vs. Full-Context: SCROLLS enables head-to-head comparison between RAG (retrieve relevant chunks) and full-context models (process the entire document). For holistic summarization, full-context wins; for specific fact retrieval, RAG is competitive.
- Legal AI: ContractNLI represents a commercially critical application — automated contract review for law firms, procurement, and compliance requires exactly the capabilities SCROLLS measures.
- Scientific AI: QASPER measures whether AI can serve as a research assistant, answering questions about specific papers from their full text.
Performance Trends
| Model (Context) | GovReport | SummScreen | ContractNLI | NarrativeQA |
|-----------------|-----------|-----------|-------------|-------------|
| BART (1k tokens) | 36.2 | 26.3 | 62.4 | 10.1 |
| LED (16k tokens) | 57.5 | 32.1 | 68.1 | 20.6 |
| GPT-4 (8k tokens) | 61.2 | 38.4 | 78.3 | 34.0 |
| Claude 2 (100k tokens) | 67.8 | 43.1 | 85.9 | 48.2 |
| GPT-4 Turbo (128k tokens) | 69.4 | 44.8 | 87.1 | 52.3 |
Evaluation Metrics
- ROUGE-1/2/L: Summarization quality by overlap with reference summaries.
- Exact Match (EM) + F1: QA performance.
- Accuracy: Classification tasks (ContractNLI).
- Geometric Mean: The SCROLLS composite score uses geometric mean across tasks to prevent one easy task from dominating.
Limitations and Criticisms
- ROUGE Limitations: ROUGE correlates poorly with human judgments for abstractive summarization — good summaries can have low ROUGE if they use different vocabulary.
- Gold Standard Quality: Some reference summaries (SummScreen) are fan-written and may not represent ideal summarization.
- Fixed Contexts: SCROLLS documents are fixed-length — doesn't test dynamic context management (deciding what to attend to) as models scale to million-token contexts.
SCROLLS is reading the whole book for AI — a benchmark proving whether long-context windows deliver real-world value on the complete documents humans produce, directly driving the multi-year industry investment in 32k, 128k, and million-token context language model architectures.