MAWPS (Math Word Problem Repository)

Keywords: mawps, mawps, evaluation

MAWPS (Math Word Problem Repository) is the unified testbed for evaluating arithmetic word problem solvers β€” aggregating multiple elementary math datasets (AddSub, MultiArith, SingleOp, SingleEq) into a standardized repository that enabled systematic comparison of semantic parsing, neural seq2seq, and symbolic AI approaches to math reasoning.

What Is MAWPS?

- Scale: ~3,320 elementary school math word problems across multiple sub-datasets.
- Operations: Single and multi-step arithmetic β€” addition, subtraction, multiplication, division.
- Difficulty: Grade school level (ages 6-12); no algebraic variables, no competition-level insight required.
- Format: Natural language problem statement β†’ numeric answer.
- Sub-datasets Included:
- AddSub: Single-step addition and subtraction (395 problems).
- MultiArith: Multi-step problems requiring multiple operations (600 problems).
- SingleOp: One-operation problems from diverse sources (562 problems).
- SingleEq: Single-equation problems with one unknown (508 problems).

The Semantic Parsing Tradition

MAWPS was created in an era when the dominant approach to math word problems was semantic parsing β€” converting text into formal representations:

- Template Mapping: "John has X apples and gives Y to Mary. How many does John have?" β†’ X - Y = ?
- Equation Trees: Represent the solution as a tree of arithmetic operations.
- Parse + Execute: Translate text to equation, then evaluate the equation.

The repository unified these approaches by providing standardized train/test splits across all sub-datasets, enabling direct comparison.

Why MAWPS Was Strategically Important

- Baseline Establishment: Before MAWPS, each paper used different datasets with incompatible splits. MAWPS created a common ground for comparison.
- Saturation Demonstration: By 2020-2022, neural models (fine-tuned BERT, GPT-3) achieved ~95%+ accuracy on MAWPS β€” demonstrating that elementary arithmetic is essentially "solved" for LLMs.
- Stepping Stone: MAWPS→GSM8K→MATH represents a progression — MAWPS confirmed arithmetic capability, motivating harder benchmarks.
- Neural vs. Symbolic: MAWPS was a key arena for comparing end-to-end neural approaches (seq2seq) against symbolic semantic parsers β€” neural won by a significant margin for simple problems.

Performance by Model Generation

| Model | MAWPS Accuracy |
|-------|---------------|
| SVM expression classifier (2015) | ~73% |
| Seq2Tree LSTM (2016) | ~88% |
| BERT fine-tuned (2020) | ~93% |
| GPT-3 few-shot (2022) | ~94% |
| GPT-4 (2023) | ~98%+ |

MAWPS in the Current Context

As a near-solved benchmark, MAWPS serves specific purposes:
- Regression Testing: Verify that new models do not lose basic arithmetic capability.
- Cross-lingual Transfer: Translate MAWPS into other languages to measure arithmetic transfer without algebraic complexity.
- Few-Shot Lower Bound: Measure how few examples a model needs to correctly solve grade-school arithmetic β€” tests sample efficiency.
- Error Analysis: The remaining ~2-5% errors reveal systematic failure modes (negative numbers, implicit unit conversions, ambiguous plurals).

Common Failure Patterns

- Implicit Units: "John bought 3 dozen eggs." Models sometimes fail to multiply by 12.
- Comparison to Reference: "Mary has 5 more apples than John, who has 8." Requires tracking two quantities.
- Multi-step Chaining: 4+ operation problems in MultiArith expose breakdown in intermediate result tracking.

Relationship to Other Benchmarks

| Benchmark | Difficulty | Focus |
|-----------|-----------|-------|
| MAWPS | Elementary | Arithmetic |
| GSM8K | Middle school | Multi-step arithmetic |
| SVAMP | Elementary + adversarial | Robustness |
| MATH | Competition level | Creative reasoning |
| AQuA-RAT | GRE/GMAT | Algebraic reasoning |

MAWPS is the elementary math class benchmark β€” historically essential for establishing arithmetic NLP baselines, now primarily serving as a sanity check confirming that modern LLMs have thoroughly mastered grade-school arithmetic word problems.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT