Qasper

Qasper is the question answering dataset over full NLP scientific papers — containing real questions asked by NLP researchers who had only seen the title and abstract of a paper, with answers grounded in the complete paper text including body paragraphs, figures, and tables, creating a direct benchmark for AI research assistant capabilities on technical scientific literature.

What Is Qasper?

- Origin: Dasigi et al. (2021) from AllenAI.
- Scale: 5,049 questions over 1,585 NLP papers from the Semantic Scholar corpus.
- Format: Questions + annotated answers with evidence spans; answers classified into 4 types.
- Document Length: ~6,000 words per paper (including abstract, methodology, experiments, results).
- Question Authors: NLP researchers who read only the title and abstract — ensuring questions reflect genuine curiosity about paper content, not trivial details.

Answer Types

Qasper classifies each answer into one of four types:

Type 1 — Extractive: The answer is a direct verbatim span from the paper.
- "What dataset do they use for training?" → "We train on the English Wikipedia dump from October 2018."

Type 2 — Abstractive: The answer synthesizes information from multiple passages.
- "How does their model compare to BERT on SQuAD?" → Requires integrating Results table and conclusion paragraph.

Type 3 — Boolean: Yes/No question with supporting evidence.
- "Do they evaluate on multilingual datasets?" → Yes (supported by Table 3 and Section 4.2).

Type 4 — Unanswerable: The paper does not contain sufficient information to answer.
- "What is their training time?" → Not reported in the paper.

Why Qasper Is Challenging

- Technical Vocabulary: NLP jargon requires domain knowledge — "Do they use byte-pair encoding?" requires knowing what BPE is and recognizing where tokenization details appear in papers.
- Diagram and Table References: Many answers require interpreting result tables (F1 scores, BLEU scores), which are dense numerical structures that models often misread.
- Paper Structure Navigation: Finding methodology details requires knowing that papers follow Introduction → Related Work → Model → Experiments → Results structure.
- Abstract Reasoning: "Does their approach generalize to low-resource languages?" is not explicitly stated — requires inferring from experimental coverage.
- Unanswerable Classification: Correctly identifying that a question cannot be answered requires reading enough of the paper to be confident the information is absent.

Performance Results

| Model | F1 (Overall) | Extractive F1 | Boolean Acc | Abstractive F1 |
|-------|-------------|--------------|-------------|----------------|
| Longformer baseline | 28.8% | 35.2% | 72.4% | 14.6% |
| LED (Allenai) | 32.1% | 38.4% | 75.1% | 18.9% |
| GPT-3.5 (RAG) | 42.6% | 49.3% | 81.2% | 28.4% |
| GPT-4 (full paper) | 58.3% | 64.7% | 87.9% | 42.1% |
| Human annotator | 82.4% | 86.1% | 91.3% | 72.8% |

Why Qasper Matters

- Research Assistant AI: Qasper directly measures the capability of "AI scientist" tools — systems that help researchers understand papers, extract experimental details, and compare results across publications.
- Scientific Literature Scale: With over 200 million academic papers published, manual reading is infeasible. Qasper benchmarks how well AI can substitute for human reading of technical papers.
- Evidence-Grounded Answers: Unlike open-domain QA, Qasper answers must cite specific evidence spans — enforcing accountability and verifiability in scientific claims.
- Unanswerable Recognition: For research tools, correctly saying "this paper doesn't report that metric" is as important as correctly extracting a reported value — Qasper explicitly evaluates this capability.
- SCROLLS Integration: Qasper is included as both a QA and summarization task in the SCROLLS benchmark, giving it dual applicability in long-context evaluation.

Applications This Enables

- Systematic Literature Review: AI tools that can answer "which papers evaluate on multilingual data?" across hundreds of papers.
- Experimental Detail Extraction: "What batch size did all ImageNet papers from 2020 use?" — automating meta-analysis.
- Peer Review Assistance: Checking if a submitted paper answers questions that reviewers are likely to ask.
- Citation Recommendation: Understanding what specific claims a paper makes to recommend it for specific citation contexts.

Qasper is the literature review benchmark — measuring AI's ability to answer the specific technical questions that scientists ask about papers, grounded in complete paper text, setting the standard for AI research assistant tools that could transform how humans navigate and synthesize the scientific literature.

Want to learn more?