Home Knowledge Base SCROLLS (Standardized CompaRison Over Long Language Sequences)

SCROLLS (Standardized CompaRison Over Long Language Sequences) is the benchmark evaluating long-context language models on realistic NLP tasks requiring processing of complete documents — unlike Long-Range Arena's synthetic sequences, SCROLLS uses real-world text: government reports, TV scripts, legal contracts, scientific papers, and books, directly measuring the practical value of extended context windows for summarization and question answering.

What Is SCROLLS?

The 7 SCROLLS Tasks

Summarization Tasks (4):

QA Tasks (3):

Why SCROLLS Matters

Performance Trends

Model (Context)GovReportSummScreenContractNLINarrativeQA
BART (1k tokens)36.226.362.410.1
LED (16k tokens)57.532.168.120.6
GPT-4 (8k tokens)61.238.478.334.0
Claude 2 (100k tokens)67.843.185.948.2
GPT-4 Turbo (128k tokens)69.444.887.152.3

Evaluation Metrics

Limitations and Criticisms

SCROLLS is reading the whole book for AI — a benchmark proving whether long-context windows deliver real-world value on the complete documents humans produce, directly driving the multi-year industry investment in 32k, 128k, and million-token context language model architectures.

scrollsscrollsevaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.