PIQA (Physical Intuition Question Answering)

PIQA (Physical Intuition Question Answering) is the benchmark dataset that evaluates physical commonsense reasoning — testing whether AI models understand how physical objects interact, what materials are made of, how tools are used, and what happens when physical processes are applied, assessing the implicit physical world model that humans acquire through embodied experience but AI systems must learn from text alone.

The Physical Intuition Gap

Language models are trained on text — descriptions of the world written by humans. But human understanding of physics is embodied: we know that wet surfaces are slippery because we have slipped; we know that eggs are fragile because we have broken them; we know that magnets attract because we have played with them. This physical intuition, acquired through direct sensorimotor experience, is only partially encoded in text descriptions.

PIQA tests whether pre-training on text alone is sufficient to acquire this physical world model, and to what extent. The benchmark reveals systematic gaps between the physical knowledge implied by text and the physical knowledge humans take for granted.

Task Format

PIQA uses a binary-choice format specifically to avoid the complexity of open-ended generation evaluation:

Goal: "To sort laundry before washing it, you should..."
Solution 1: "Separate the clothes by color and fabric type." (Correct)
Solution 2: "Mix all clothes together in the machine." (Incorrect)

Goal: "To cool soup quickly..."
Solution 1: "Pour it into a shallow wide bowl and stir occasionally." (Correct)
Solution 2: "Pour it into a deep narrow container and cover it." (Incorrect)

Goal: "To remove a stripped screw..."
Solution 1: "Use a rubber band between the screwdriver and screw head for extra grip." (Correct)
Solution 2: "Apply more force with the same screwdriver." (Incorrect)

Each question presents a practical goal and two solutions. One solution applies correct physical reasoning; the other violates physical principles or uses physically ineffective methods. Annotation is crowdsourced with quality validation.

Dataset Statistics and Construction

- Training set: 16,113 examples.
- Development set: 1,838 examples.
- Test set: 3,084 examples (labels withheld for leaderboard evaluation).
- Human performance: ~95% accuracy.
- Majority baseline: ~53% (slightly above 50% due to class imbalance).
- Construction: Workers were asked to think of everyday physical tasks and write one correct and one plausible-but-incorrect solution procedure.

Why PIQA Is Challenging for Language Models

Embodiment Gap: Models have never touched, lifted, heated, or cooled anything. Physical intuition from text is indirect — descriptions of physical processes rather than direct sensorimotor feedback.

Implicit Physics: Correct physical reasoning often relies on principles never explicitly stated in training data. That a rubber band increases friction with a screw head is not a fact typically written in text; it follows from implicit understanding of friction, materials, and grip mechanics.

Anti-Correlation with Language Fluency: Both solutions in each PIQA question are linguistically fluent and grammatically correct. Language model perplexity alone cannot discriminate between them — the task requires semantic understanding of physical processes rather than surface linguistic quality.

Long-Tail Physical Knowledge: Many PIQA scenarios involve specialized knowledge (tool use, cooking techniques, household repairs) that appears infrequently in text corpora and may be systematically underrepresented in pre-training data.

Performance Benchmarks

| Model | PIQA Accuracy |
|-------|--------------|
| BERT-large | 70.2% |
| RoBERTa-large | 77.1% |
| GPT-3 (175B) | 82.8% |
| UnifiedQA-3B | 84.7% |
| Human performance | 94.9% |

The persistent 10+ point gap between the best models and human performance (as of the benchmark's first few years) highlighted the depth of the physical reasoning deficit. More recent LLMs (GPT-4, Claude 3) perform substantially better but the gap reflects continued challenges in physical world modeling.

Relationship to Other Commonsense Benchmarks

PIQA occupies a distinct niche in the commonsense benchmarking landscape:

| Benchmark | Knowledge Type |
|-----------|---------------|
| PIQA | Physical interactions, materials, tools |
| HellaSwag | Activity continuations, temporal sequences |
| Winogrande | Pronoun resolution with commonsense inference |
| CommonsenseQA | General commonsense (social, physical, causal) |
| Social IQa | Social commonsense, interpersonal reasoning |
| ATOMIC | Causal commonsense about events and states |

PIQA's focus on specifically physical knowledge (as opposed to social, temporal, or causal) makes it a targeted probe for the embodiment gap in language models.

Applications Beyond Benchmarking

Physical commonsense reasoning is essential for:
- Robotics: Planning manipulation tasks requires knowing that objects are rigid, fragile, or deformable; that surfaces have friction; that gravity acts consistently.
- AI Assistants: Answering "How do I fix this?" questions requires physical reasoning about materials and mechanisms.
- Code Generation for Physical Simulations: Writing physically correct simulation code requires understanding physical principles.
- Safety Systems: Recognizing physically dangerous instructions or plans requires a model of physical cause and effect.

PIQA is the benchmark that measures the embodiment gap — quantifying how much physical world knowledge language models acquire from text alone, and revealing the systematic deficit between linguistic fluency and genuine physical understanding that remains one of the core challenges in AI.

PIQA (Physical Intuition Question Answering)

Want to learn more?