SciTail

SciTail is the textual entailment dataset derived from elementary science questions — constructed by converting multiple-choice science exam questions into premise-hypothesis pairs and requiring models to determine whether a retrieved science textbook passage entails a candidate answer statement, making it a domain-specific NLI benchmark that tests scientific reasoning rather than general language inference.

Construction Methodology

SciTail's construction is distinctive: it derives NLI pairs from a QA task rather than directly annotating entailment relationships. The process:

Step 1 — Science QA Source: Questions come from ARC (AI2 Reasoning Challenge), a dataset of 8,000 multiple-choice science exam questions from grades 3–9, covering topics like biology, chemistry, physics, earth science, and astronomy.

Step 2 — Statement Conversion: Each multiple-choice question + answer option is converted into a declarative statement (the hypothesis):
- Question: "What organ produces insulin in the human body?"
- Answer option: "The pancreas"
- Hypothesis: "The pancreas produces insulin in the human body."

Step 3 — Evidence Retrieval: For each hypothesis, relevant sentences are retrieved from a science textbook corpus using information retrieval.

Step 4 — Entailment Annotation: Human annotators determine whether each retrieved sentence (premise) entails the hypothesis (Entails / Neutral). The premise either clearly establishes the scientific fact stated in the hypothesis or does not.

Dataset Statistics

- Training set: 23,596 premise-hypothesis pairs.
- Development set: 1,304 pairs.
- Test set: 2,126 pairs.
- Class distribution: ~33% Entails, ~67% Neutral (no "Contradiction" label — retrieved evidence cannot contradict hypotheses by construction).
- Label: Binary (Entails / Neutral), unlike standard three-class NLI.

Why SciTail Is Different from Standard NLI

Domain Specificity: Standard NLI datasets (SNLI, MNLI) draw from general text (image captions, news, fiction). SciTail uses science textbook language — precise, technical, definitional prose that differs substantially from conversational or journalistic text.

No Contradiction Class: Because hypotheses are constructed from answer candidates (which are plausibly related to the question topic) and premises are retrieved by relevance, the retrieved evidence either entails the hypothesis or is merely tangentially related — deliberate contradictions are not generated.

Factual Accuracy Requirement: Scientific entailment requires accurate reasoning about facts, not just logical inference from premises. "Mitochondria produce ATP" entails "cells generate energy through organelles" requires both understanding the biological process and recognizing the paraphrase relationship.

Scientific Vocabulary: Specialized terminology (photosynthesis, mitosis, tectonic plates, Newton's laws) requires either pre-training on scientific text or domain adaptation to handle correctly.

Why SciTail Is Hard

Lexical Paraphrase Gap: Science textbooks often explain concepts using technical vocabulary, while exam questions use more accessible language. "The sun's gravitational pull keeps planets in orbit" must be recognized as entailing "the force of gravity from stars maintains planetary motion."

Conceptual Abstraction: Connecting specific facts to general principles:
- Premise: "Water expands when it freezes, which is why ice is less dense than liquid water."
- Hypothesis: "Solid water is less dense than liquid water."
- Relationship: Entails — but requires recognizing "ice" = "solid water" and understanding the density implication.

Multi-Step Inference: Some entailment relationships require implicit reasoning steps:
- Premise: "Plants use sunlight to convert CO2 and water into glucose."
- Hypothesis: "Photosynthesis requires light energy."
- Relationship: Entails — but requires connecting "sunlight" to "light energy" and recognizing "photosynthesis" as the process described.

Model Performance

| Model | SciTail Accuracy |
|-------|----------------|
| DecompAtt (decomposable attention) | 72.3% |
| BiLSTM + attention | 75.2% |
| BERT-base | 94.0% |
| RoBERTa-large | 96.3% |
| Human | ~88% estimated |

The large jump from LSTM-based models to BERT (75% → 94%) demonstrates BERT's pre-training knowledge of scientific facts and paraphrase relationships. BERT surpasses estimated human accuracy on SciTail — partly because human annotators are slower at recognizing entailment under time pressure for technical content, while BERT has memorized vast amounts of scientific text.

SciTail in the NLP Ecosystem

SciTail serves several roles:

Domain Transfer Test: Models trained on MNLI or SNLI and then evaluated on SciTail measure how well NLI reasoning transfers to the science domain. BERT-based models transfer well; LSTM models with word embeddings show larger domain gaps.

Retriever Evaluation: In open-domain science QA systems, the retrieval component must find passages that entail correct answers and not retrieve passages that are tangentially related. SciTail evaluates whether a retrieval-entailment pipeline correctly separates relevant from irrelevant evidence.

Science QA Pre-training: Training on SciTail as an auxiliary task improves performance on downstream science QA (ARC, OpenBookQA) by explicitly training models on the entailment relationship between textbook evidence and science statements.

Cross-Domain NLI Analysis: Comparing SNLI/MNLI-trained model performance on SciTail vs. in-domain SciTail performance reveals how much domain-specific knowledge (vs. general entailment reasoning) drives performance differences.

SciTail is science class logic — an entailment benchmark that tests whether models can determine when a textbook explanation proves a scientific claim, requiring both accurate world knowledge and the reasoning ability to bridge the paraphrase gap between textbook language and exam question formulations.

Want to learn more?