NewsQA

NewsQA is the machine reading comprehension dataset of 119,633 question-answer pairs based on CNN news articles — distinguished by its information-seeking construction methodology where crowdworkers wrote questions after seeing only the article headline and summary bullets, not the full article, ensuring questions represent genuine curiosity-driven information seeking rather than passage-scanning exercises.

Construction Methodology and Its Significance

Most reading comprehension datasets are constructed retrospectively: annotators read a passage and then write questions about what they just read. This produces questions whose answers are mentally available to the question writer, often leading to questions that can be answered by surface-level keyword matching rather than genuine comprehension.

NewsQA used a two-phase construction that separates question creation from answer annotation:

Phase 1 — Question Writing: Crowdworkers saw only the CNN article headline and the editorial highlight bullets (3–5 key facts). Without reading the full article, they wrote questions they would want answered — genuine information gaps relative to what the headline and bullets told them.

Phase 2 — Answer Annotation: A different set of crowdworkers received the full article and each question, then selected the answer span (or marked it as unanswerable). Multiple annotators provided answers; disagreements were adjudicated.

This separation produces questions that genuinely probe the article's informational content rather than surface features of the text — because question writers had no access to the surface form of the article.

Dataset Characteristics

- Source: 12,744 CNN articles from the CNN/Daily Mail dataset.
- Scale: 119,633 question-answer pairs (9.4 questions per article on average).
- Answer format: Text spans from the article (extractive), or NULL (no answer).
- Null answers: ~9.5% of questions are marked as unanswerable from the article.
- Human F1: ~69.4 (reflecting genuine question difficulty and inter-annotator disagreement).
- Question types: Why (15%), Where (13%), Who (26%), What (31%), When (8%), How (7%).

Challenges and Characteristics

Inverted Pyramid Reading: CNN news articles use the inverted pyramid structure — most important information at the top, supporting details below. NewsQA questions frequently probe the supporting detail sections rather than the lead paragraph, requiring reading the full article.

Multi-Sentence Evidence: Many NewsQA answers require integrating information across multiple non-adjacent sentences. "Why did the president veto the bill?" may require one sentence stating the veto and another giving the reason, separated by paragraphs of background.

Ambiguous and Null Answers: The information-seeking construction naturally produces questions that the article does not fully answer — reflecting the reality that news articles often raise more questions than they resolve. The 9.5% null rate is lower than SQuAD 2.0 (50%) but reflects genuine information gaps.

Journalism-Specific Language: News writing uses specialized conventions: attributions ("according to officials"), hedging ("allegedly"), temporal markers ("last Tuesday"), and unnamed sources ("a senior official said"). Models must handle these conventions to extract accurate answers.

Comparison with SQuAD

| Aspect | SQuAD v1.1 | NewsQA |
|--------|-----------|--------|
| Source | Wikipedia (encyclopedia) | CNN news articles |
| Construction | Retrospective | Information-seeking |
| Article length | ~120 words/passage | ~600 words/article |
| Null answers | None | ~9.5% |
| Human F1 | ~91.2 | ~69.4 |
| Answer distribution | Uniform | Front-heavy (inverted pyramid) |

The lower human F1 on NewsQA (69.4 vs. 91.2) reflects genuine ambiguity in news writing: multiple valid interpretations, partial answers, and questions that touch on information only implied rather than stated in the article.

Model Performance

| Model | NewsQA F1 |
|-------|----------|
| LSTM baseline | 50.1 |
| BERT-base | 65.9 |
| RoBERTa-large | 74.2 |
| Human | 69.4 |

RoBERTa-large surpasses the human baseline in F1, but human annotators show more consistent and semantically valid answers at individual question level — the F1 metric advantage reflects answer span selection patterns rather than genuine comprehension superiority.

Information-Seeking QA and Downstream Applications

NewsQA's information-seeking design mirrors real-world applications:

News Search and Retrieval: Users searching for information about an event have seen headlines and want specific details — exactly the information gap that NewsQA questions model.

Automated Journalism: Systems that generate news summaries or answer questions about breaking events need the comprehension skills NewsQA tests.

Fact-Checking: Verifying claims against news articles requires reading journalism-style text and extracting specific factual claims.

Enterprise Knowledge Management: Internal news feeds and corporate communications require the same information-seeking QA pattern — employees who have seen an executive summary want details from the underlying report.

Legacy and Influence

NewsQA contributed to the understanding that:
- Construction methodology matters: Information-seeking construction produces harder, more naturalistic questions than retrospective construction.
- Human performance varies by domain: The ~69% human F1 demonstrated that "human-level" is domain-dependent — humans agree less on news QA than on encyclopedia QA because news is intentionally ambiguous.
- Domain-specific pre-training helps: Models pre-trained or fine-tuned on news text (e.g., trained on MNLI + SQuAD then fine-tuned on NewsQA) consistently outperform models without news-domain exposure.

NewsQA is the news reading comprehension benchmark built around genuine curiosity — constructed so that questions reflect what a reader actually wants to know after seeing a headline, producing a harder and more realistic reading comprehension challenge than passage-scanning exercises.

Want to learn more?