Home Knowledge Base CommonsenseQA

CommonsenseQA is a multiple-choice question answering benchmark that tests an AI system's ability to apply implicit background knowledge about the world — the kind of everyday reasoning humans perform effortlessly but that standard NLP models find challenging because the answers are not retrievable from any text but require knowing how the physical world works, social norms, and typical human behavior. Constructed by Talmor et al. (2019) using crowd-sourced questions derived from ConceptNet knowledge graphs, CommonsenseQA has become one of the most important benchmarks for measuring progress toward AI systems with human-like general understanding.

What Commonsense Reasoning Means

Commonsense reasoning is the ability to apply obvious, unstated knowledge that any human implicitly possesses:

None of this is typically written down explicitly. There is no Wikipedia article saying "quiet is important in libraries." Language models trained purely on text can learn statistical associations, but true commonsense goes deeper — it requires causal, spatial, temporal, and social reasoning grounded in world experience.

Dataset Construction Methodology

CommonsenseQA's construction pipeline using ConceptNet:

1. Start with a ConceptNet relation: e.g., (museum, AtLocation, city center) 2. Generate a seed question requiring knowledge of this relation: "Where is a museum typically located?" 3. Find answer candidates: Use ConceptNet graph traversal to find semantically related but conceptually different nodes (other AtLocation targets like "neighborhood," "rural area," "shopping mall") 4. Human validation: Crowd workers verify that only one answer is clearly correct and distractors are plausible but wrong 5. Result: 12,247 multiple-choice questions, 5 choices each, with train/validation/test splits

Example Questions

Physical causation: "What happens when you flip a switch connected to a lamp?" A. The lamp gets hot B. The lamp turns on ✓ C. The switch breaks D. Nothing happens E. The room floods

Spatial reasoning: "Where would you go to buy fresh vegetables?" A. Hardware store B. Post office C. Farmers market ✓ D. Car dealership E. Police station

Social reasoning: "If someone is feeling cold, what might they ask for?" A. More criticism B. A blanket ✓ C. A math problem D. A loud noise E. Extra sunlight

Temporal reasoning: "What would happen to ice cream left outside on a hot day?" A. It freezes solid B. It becomes larger C. It melts ✓ D. It turns blue E. It becomes louder

Model Performance Landscape

SystemAccuracy (Test)Notes
Random baseline20%5-choice random
Human performance~89%Crowd worker consensus
BERT-Large (2019)55.9%First transformer results
RoBERTa-Large (2020)72.1%Contextual pretraining improves
UnifiedQA (T5) (2020)78.0%Multi-task QA model
GPT-3 (few-shot) (2021)73.0%In-context learning
ChatGPT (GPT-3.5) (2023)~85%RLHF-tuned improves commonsense
GPT-4 (2023)~90-95%Near/at human level
Claude 3 Opus (2024)~95%+Exceeds human baseline

Modern frontier LLMs (GPT-4, Claude 3, Gemini Ultra) have essentially saturated CommonsenseQA, marking it as a largely solved benchmark. However, the challenge of commonsense reasoning is far from solved — more difficult benchmarks like HellaSwag, WinoGrande, and the more adversarial ANLI continue to probe commonsense failures.

Why CommonsenseQA Matters for AI Evaluation

Probing genuine understanding: Unlike reading comprehension datasets (SQuAD, TriviaQA) where answers appear verbatim in provided text, CommonsenseQA requires knowledge stored in model weights — not provided in context. This tests whether a model has internalized world knowledge, not just learned to extract spans.

Benchmark diagnostic: Comparing a model's CommonsenseQA score against its reading comprehension and reasoning scores reveals the knowledge component versus the extraction/reasoning component of model capability.

Safety implications: Commonsense deficits correlate with dangerous model behaviors:

Benchmark Suite Context

CommonsenseQA is typically evaluated alongside:

BenchmarkTestsDifficulty
CommonsenseQAEveryday factual commonsenseMedium (saturated by GPT-4)
HellaSwagSentence completion requiring world modelMedium-Hard
WinoGrandePronoun resolution requiring commonsenseHard
PIQAPhysical intuition QAMedium
Social IQa (SIQA)Social interaction reasoningMedium
AlpacaEval/MT-BenchMulti-turn instruction followingHolistic

Limitations of CommonsenseQA

CommonsenseQA remains a historical milestone that demonstrated the gap between statistical language patterns and genuine world understanding — spurring a generation of research into knowledge-grounded AI, neural-symbolic integration, and eventually the massive pre-training at scale that allowed LLMs to internalize commonsense knowledge implicitly.

commonsenseqacommonsense reasoningqa benchmarkAI evaluation benchmarknlp benchmark

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.