MAWPS (Math Word Problem Repository) is the unified testbed for evaluating arithmetic word problem solvers β aggregating multiple elementary math datasets (AddSub, MultiArith, SingleOp, SingleEq) into a standardized repository that enabled systematic comparison of semantic parsing, neural seq2seq, and symbolic AI approaches to math reasoning.
What Is MAWPS?
- Scale: ~3,320 elementary school math word problems across multiple sub-datasets.
- Operations: Single and multi-step arithmetic β addition, subtraction, multiplication, division.
- Difficulty: Grade school level (ages 6-12); no algebraic variables, no competition-level insight required.
- Format: Natural language problem statement β numeric answer.
- Sub-datasets Included:
- AddSub: Single-step addition and subtraction (395 problems).
- MultiArith: Multi-step problems requiring multiple operations (600 problems).
- SingleOp: One-operation problems from diverse sources (562 problems).
- SingleEq: Single-equation problems with one unknown (508 problems).
The Semantic Parsing Tradition
MAWPS was created in an era when the dominant approach to math word problems was semantic parsing β converting text into formal representations:
- Template Mapping: "John has X apples and gives Y to Mary. How many does John have?" β X - Y = ?
- Equation Trees: Represent the solution as a tree of arithmetic operations.
- Parse + Execute: Translate text to equation, then evaluate the equation.
The repository unified these approaches by providing standardized train/test splits across all sub-datasets, enabling direct comparison.
Why MAWPS Was Strategically Important
- Baseline Establishment: Before MAWPS, each paper used different datasets with incompatible splits. MAWPS created a common ground for comparison.
- Saturation Demonstration: By 2020-2022, neural models (fine-tuned BERT, GPT-3) achieved ~95%+ accuracy on MAWPS β demonstrating that elementary arithmetic is essentially "solved" for LLMs.
- Stepping Stone: MAWPSβGSM8KβMATH represents a progression β MAWPS confirmed arithmetic capability, motivating harder benchmarks.
- Neural vs. Symbolic: MAWPS was a key arena for comparing end-to-end neural approaches (seq2seq) against symbolic semantic parsers β neural won by a significant margin for simple problems.
Performance by Model Generation
| Model | MAWPS Accuracy |
|-------|---------------|
| SVM expression classifier (2015) | ~73% |
| Seq2Tree LSTM (2016) | ~88% |
| BERT fine-tuned (2020) | ~93% |
| GPT-3 few-shot (2022) | ~94% |
| GPT-4 (2023) | ~98%+ |
MAWPS in the Current Context
As a near-solved benchmark, MAWPS serves specific purposes:
- Regression Testing: Verify that new models do not lose basic arithmetic capability.
- Cross-lingual Transfer: Translate MAWPS into other languages to measure arithmetic transfer without algebraic complexity.
- Few-Shot Lower Bound: Measure how few examples a model needs to correctly solve grade-school arithmetic β tests sample efficiency.
- Error Analysis: The remaining ~2-5% errors reveal systematic failure modes (negative numbers, implicit unit conversions, ambiguous plurals).
Common Failure Patterns
- Implicit Units: "John bought 3 dozen eggs." Models sometimes fail to multiply by 12.
- Comparison to Reference: "Mary has 5 more apples than John, who has 8." Requires tracking two quantities.
- Multi-step Chaining: 4+ operation problems in MultiArith expose breakdown in intermediate result tracking.
Relationship to Other Benchmarks
| Benchmark | Difficulty | Focus |
|-----------|-----------|-------|
| MAWPS | Elementary | Arithmetic |
| GSM8K | Middle school | Multi-step arithmetic |
| SVAMP | Elementary + adversarial | Robustness |
| MATH | Competition level | Creative reasoning |
| AQuA-RAT | GRE/GMAT | Algebraic reasoning |
MAWPS is the elementary math class benchmark β historically essential for establishing arithmetic NLP baselines, now primarily serving as a sanity check confirming that modern LLMs have thoroughly mastered grade-school arithmetic word problems.