FEVER (Fact Extraction and VERification) is a large-scale benchmark dataset and shared task for evaluating automated fact-checking systems. It is the most widely used benchmark for systems that verify claims against textual evidence.
Dataset Structure
- 185,445 claims generated by altering sentences from Wikipedia, then manually verified by annotators.
- Evidence: The knowledge source is the full English Wikipedia (~5.4 million articles at time of creation).
- Labels: Each claim is labeled as:
- SUPPORTED: Evidence in Wikipedia confirms the claim.
- REFUTED: Evidence in Wikipedia contradicts the claim.
- NOT ENOUGH INFO (NEI): Wikipedia doesn't contain sufficient evidence to verify or refute.
The FEVER Task
- Step 1 — Document Retrieval: Given a claim, identify relevant Wikipedia documents.
- Step 2 — Sentence Selection: From retrieved documents, select the specific sentences that serve as evidence.
- Step 3 — Claim Verification: Using the selected evidence, classify the claim as SUPPORTED, REFUTED, or NEI.
- Evaluation Metric: FEVER Score — a claim is correctly verified only if both the label is correct AND the evidence sentences are correct (for SUPPORTED/REFUTED claims).
Why FEVER Matters
- Standard Benchmark: Nearly all automated fact-checking papers evaluate on FEVER, enabling direct comparison.
- Full Pipeline Evaluation: Tests the complete fact-checking pipeline, not just individual components.
- Research Impact: Has driven significant advances in evidence retrieval and natural language inference.
FEVER Shared Tasks
- FEVER 1.0 (2018): First shared task. Winning systems used TF-IDF retrieval + BERT-based NLI.
- FEVER 2.0 (2019): Added adversarial claim generation to test system robustness.
- Subsequent Work: Extensions like symmetric FEVER, multi-lingual FEVER, and FEVER with structured evidence.
State-of-the-Art Performance
- Top systems achieve ~80–85% FEVER Score, leaving significant room for improvement.
- The hardest cases involve multi-hop reasoning (requiring evidence from multiple sources) and NEI classification (distinguishing "not enough info" from "refuted").
Limitations
- Wikipedia Only: Real-world fact-checking requires evidence from diverse sources beyond Wikipedia.
- Synthetic Claims: Claims were generated by altering Wikipedia sentences, which may not reflect natural misinformation patterns.
- Temporal: Based on a Wikipedia snapshot — doesn't capture evolving knowledge.
FEVER is the foundational benchmark for automated fact-checking research — it established the standard evaluation framework that the field continues to build upon.
fever (fact extraction and verification)feverfact extraction and verificationevaluation
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.