SQuAD (Stanford Question Answering Dataset)

SQuAD (Stanford Question Answering Dataset) is the reading comprehension benchmark that defined the extractive QA paradigm — consisting of questions posed on Wikipedia passages where the answer must be a contiguous text span (substring) from the passage, fueling the development of BERT-era span-extraction architectures and establishing the reading comprehension task format that dominated NLP from 2016 to 2020.

Origins and Construction

SQuAD v1.1 was released in 2016 by Rajpurkar et al. at Stanford. Construction methodology:
- Source: 536 Wikipedia articles across diverse topics.
- Crowdsourcing: Amazon Mechanical Turk workers read each passage and wrote five factoid questions per paragraph, along with selecting the answer span.
- Scale: 107,785 question-answer pairs across 536 articles and 23,215 paragraphs.
- Guarantee: Every question in v1.1 is guaranteed to have an answer within the passage — the model's task is only to locate the answer, not determine answerability.

The Span Extraction Formulation

SQuAD established the standard output format for BERT-era QA:
- Input: Passage P (context) + Question Q.
- Output: Start token index and end token index within P that define the answer span.
- Model architecture: A linear layer over BERT token representations produces "start logits" and "end logits" for each token; the argmax of each gives the predicted span.

This formulation is elegant: the model's task reduces to binary classification at each token position (is this the start/end of the answer?), enabling efficient fine-tuning on top of pre-trained language models.

Evaluation Metrics

Exact Match (EM): Fraction of predictions where the predicted span exactly matches one of the ground truth answer spans (normalized for punctuation and articles). A strict metric that penalizes minor paraphrasing.

F1 Score: Token-level F1 between predicted and ground truth answers, computed as the harmonic mean of precision (fraction of predicted tokens that are correct) and recall (fraction of correct tokens that are predicted). More forgiving than EM and the primary ranking metric.

Human Performance: Human annotators on SQuAD v1.1 achieve ~82.3 EM and ~91.2 F1. BERT-large surpassed human performance on SQuAD v1.1 development set in late 2018 (EM: 84.1, F1: 90.9), demonstrating that span extraction from well-formed passages was essentially solved by large pretrained transformers.

SQuAD 2.0 — The Answerability Challenge

SQuAD v2.0 (2018) added 53,775 unanswerable questions to the original v1.1 data — adversarially written to be plausible given the passage but not actually answerable from it.

"What color is the sky in this passage?" when the passage discusses atmospheric optics but never names the color.

The model must now make two decisions:
1. Is the question answerable from the passage? (Binary classification using [CLS] representation)
2. If yes, what is the answer span? (Start/end logit prediction)

SQuAD 2.0 is significantly harder: models must avoid extracting plausible-looking spans for unanswerable questions. The threshold between "answerable" and "unanswerable" requires understanding the passage at a semantic level, not just finding keyword-matching spans. Human performance: ~86 EM / ~89.5 F1. Top models: ~90 EM / ~92 F1 as of 2021.

The BERT Revolution on SQuAD

SQuAD became the primary benchmark demonstrating BERT's superiority:

| Model | SQuAD v1.1 F1 | SQuAD v2.0 F1 |
|-------|--------------|--------------|
| BiDAF (2016) | 77.3 | — |
| R-NET (2017) | 86.0 | — |
| BERT-large (2018) | 93.2 | 83.0 |
| RoBERTa (2019) | 94.6 | 86.8 |
| ALBERT-xxlarge (2020) | 95.0 | 90.9 |
| Human | 91.2 | 89.5 |

BERT's 6-point F1 improvement over R-NET (the previous state-of-the-art) on a single SQuAD fine-tuning established the transfer learning paradigm as the dominant approach to NLP tasks.

Limitations and Critiques

Span-Only Answers: SQuAD only tests questions answerable by text spans. It cannot evaluate questions requiring synthesis, arithmetic, temporal reasoning, or information not in the passage.

Simplified Passages: Wikipedia passages are well-structured, factual, and clearly written. Real-world QA involves noisy, ambiguous, or contradictory sources.

Short Passages: Passages average ~120 words. Long-document reading comprehension (books, reports, legal contracts) is not tested.

Train-Test Distribution: Questions are about the same 536 Wikipedia articles in train and test. Topic-specific factual shortcuts may inflate performance.

Legacy Datasets Inspired by SQuAD

SQuAD spawned a generation of reading comprehension datasets:
- TriviaQA: 650k question-answer-evidence triples from trivia sources. Answers verified against multiple Wikipedia documents.
- Natural Questions: Real Google search queries with long and short answer annotations from Wikipedia.
- HotpotQA: Multi-hop reasoning across two Wikipedia paragraphs required to answer each question.
- QuAC: Conversational QA where context accumulates across dialogue turns.
- DROP: Discrete reasoning requiring counting, arithmetic, and sorting over passage content.

SQuAD is the reading comprehension benchmark that launched the extractive QA era — defining the span-extraction output format adopted by BERT, establishing that passage-grounded answering is achievable at near-human performance, and inspiring a decade of increasingly challenging QA benchmarks.

SQuAD (Stanford Question Answering Dataset)

Want to learn more?