CommonsenseQA is a multiple-choice question answering benchmark that tests an AI system's ability to apply implicit background knowledge about the world — the kind of everyday reasoning humans perform effortlessly but that standard NLP models find challenging because the answers are not retrievable from any text but require knowing how the physical world works, social norms, and typical human behavior. Constructed by Talmor et al. (2019) using crowd-sourced questions derived from ConceptNet knowledge graphs, CommonsenseQA has become one of the most important benchmarks for measuring progress toward AI systems with human-like general understanding.
What Commonsense Reasoning Means
Commonsense reasoning is the ability to apply obvious, unstated knowledge that any human implicitly possesses:
- "If you push something, it moves away from you" (physical causation)
- "A library requires quiet because people are reading" (situational awareness)
- "Putting a key in a lock takes a second; losing a key takes days to resolve" (time and consequence reasoning)
- "People buy sunscreen at the beach more than at a ski resort" (contextual appropriateness)
None of this is typically written down explicitly. There is no Wikipedia article saying "quiet is important in libraries." Language models trained purely on text can learn statistical associations, but true commonsense goes deeper — it requires causal, spatial, temporal, and social reasoning grounded in world experience.
Dataset Construction Methodology
CommonsenseQA's construction pipeline using ConceptNet:
1. Start with a ConceptNet relation: e.g., (museum, AtLocation, city center) 2. Generate a seed question requiring knowledge of this relation: "Where is a museum typically located?" 3. Find answer candidates: Use ConceptNet graph traversal to find semantically related but conceptually different nodes (other AtLocation targets like "neighborhood," "rural area," "shopping mall") 4. Human validation: Crowd workers verify that only one answer is clearly correct and distractors are plausible but wrong 5. Result: 12,247 multiple-choice questions, 5 choices each, with train/validation/test splits
Example Questions
Physical causation: "What happens when you flip a switch connected to a lamp?" A. The lamp gets hot B. The lamp turns on ✓ C. The switch breaks D. Nothing happens E. The room floods
Spatial reasoning: "Where would you go to buy fresh vegetables?" A. Hardware store B. Post office C. Farmers market ✓ D. Car dealership E. Police station
Social reasoning: "If someone is feeling cold, what might they ask for?" A. More criticism B. A blanket ✓ C. A math problem D. A loud noise E. Extra sunlight
Temporal reasoning: "What would happen to ice cream left outside on a hot day?" A. It freezes solid B. It becomes larger C. It melts ✓ D. It turns blue E. It becomes louder
Model Performance Landscape
| System | Accuracy (Test) | Notes |
|---|---|---|
| Random baseline | 20% | 5-choice random |
| Human performance | ~89% | Crowd worker consensus |
| BERT-Large (2019) | 55.9% | First transformer results |
| RoBERTa-Large (2020) | 72.1% | Contextual pretraining improves |
| UnifiedQA (T5) (2020) | 78.0% | Multi-task QA model |
| GPT-3 (few-shot) (2021) | 73.0% | In-context learning |
| ChatGPT (GPT-3.5) (2023) | ~85% | RLHF-tuned improves commonsense |
| GPT-4 (2023) | ~90-95% | Near/at human level |
| Claude 3 Opus (2024) | ~95%+ | Exceeds human baseline |
Modern frontier LLMs (GPT-4, Claude 3, Gemini Ultra) have essentially saturated CommonsenseQA, marking it as a largely solved benchmark. However, the challenge of commonsense reasoning is far from solved — more difficult benchmarks like HellaSwag, WinoGrande, and the more adversarial ANLI continue to probe commonsense failures.
Why CommonsenseQA Matters for AI Evaluation
Probing genuine understanding: Unlike reading comprehension datasets (SQuAD, TriviaQA) where answers appear verbatim in provided text, CommonsenseQA requires knowledge stored in model weights — not provided in context. This tests whether a model has internalized world knowledge, not just learned to extract spans.
Benchmark diagnostic: Comparing a model's CommonsenseQA score against its reading comprehension and reasoning scores reveals the knowledge component versus the extraction/reasoning component of model capability.
Safety implications: Commonsense deficits correlate with dangerous model behaviors:
- "If I tell the AI to do X, does it understand the likely side effects?" requires physical commonsense
- "If I ask the AI for Y, can it understand the social context?" requires social commonsense
- Early AI safety research used commonsense failures to demonstrate model brittleness
Benchmark Suite Context
CommonsenseQA is typically evaluated alongside:
| Benchmark | Tests | Difficulty |
|---|---|---|
| CommonsenseQA | Everyday factual commonsense | Medium (saturated by GPT-4) |
| HellaSwag | Sentence completion requiring world model | Medium-Hard |
| WinoGrande | Pronoun resolution requiring commonsense | Hard |
| PIQA | Physical intuition QA | Medium |
| Social IQa (SIQA) | Social interaction reasoning | Medium |
| AlpacaEval/MT-Bench | Multi-turn instruction following | Holistic |
Limitations of CommonsenseQA
- ConceptNet bias: Questions reflect the structural biases of ConceptNet, which overrepresents Western cultural contexts
- Multiple-choice format: Models can use answer option patterns and elimination strategies that don't require genuine understanding
- Saturation: State-of-the-art models score above human baselines — new benchmarks are needed for continued progress measurement
- English-only: Commonsense varies significantly across cultures and languages; CommonsenseQA does not capture this diversity
CommonsenseQA remains a historical milestone that demonstrated the gap between statistical language patterns and genuine world understanding — spurring a generation of research into knowledge-grounded AI, neural-symbolic integration, and eventually the massive pre-training at scale that allowed LLMs to internalize commonsense knowledge implicitly.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.