MAWPS (Math Word Problem Repository)

Home› Knowledge Base› MAWPS (Math Word Problem Repository)

MAWPS (Math Word Problem Repository) is the unified testbed for evaluating arithmetic word problem solvers — aggregating multiple elementary math datasets (AddSub, MultiArith, SingleOp, SingleEq) into a standardized repository that enabled systematic comparison of semantic parsing, neural seq2seq, and symbolic AI approaches to math reasoning.

What Is MAWPS?

Scale: ~3,320 elementary school math word problems across multiple sub-datasets.
Operations: Single and multi-step arithmetic — addition, subtraction, multiplication, division.
Difficulty: Grade school level (ages 6-12); no algebraic variables, no competition-level insight required.
Format: Natural language problem statement → numeric answer.
Sub-datasets Included:

AddSub: Single-step addition and subtraction (395 problems).

MultiArith: Multi-step problems requiring multiple operations (600 problems).

SingleOp: One-operation problems from diverse sources (562 problems).

SingleEq: Single-equation problems with one unknown (508 problems).

The Semantic Parsing Tradition

MAWPS was created in an era when the dominant approach to math word problems was semantic parsing — converting text into formal representations:

Template Mapping: "John has X apples and gives Y to Mary. How many does John have?" → X - Y = ?
Equation Trees: Represent the solution as a tree of arithmetic operations.
Parse + Execute: Translate text to equation, then evaluate the equation.

The repository unified these approaches by providing standardized train/test splits across all sub-datasets, enabling direct comparison.

Why MAWPS Was Strategically Important

Baseline Establishment: Before MAWPS, each paper used different datasets with incompatible splits. MAWPS created a common ground for comparison.
Saturation Demonstration: By 2020-2022, neural models (fine-tuned BERT, GPT-3) achieved ~95%+ accuracy on MAWPS — demonstrating that elementary arithmetic is essentially "solved" for LLMs.
Stepping Stone: MAWPS→GSM8K→MATH represents a progression — MAWPS confirmed arithmetic capability, motivating harder benchmarks.
Neural vs. Symbolic: MAWPS was a key arena for comparing end-to-end neural approaches (seq2seq) against symbolic semantic parsers — neural won by a significant margin for simple problems.

Performance by Model Generation

Model	MAWPS Accuracy
SVM expression classifier (2015)	~73%
Seq2Tree LSTM (2016)	~88%
BERT fine-tuned (2020)	~93%
GPT-3 few-shot (2022)	~94%
GPT-4 (2023)	~98%+

MAWPS in the Current Context

As a near-solved benchmark, MAWPS serves specific purposes:

Regression Testing: Verify that new models do not lose basic arithmetic capability.
Cross-lingual Transfer: Translate MAWPS into other languages to measure arithmetic transfer without algebraic complexity.
Few-Shot Lower Bound: Measure how few examples a model needs to correctly solve grade-school arithmetic — tests sample efficiency.
Error Analysis: The remaining ~2-5% errors reveal systematic failure modes (negative numbers, implicit unit conversions, ambiguous plurals).

Common Failure Patterns

Implicit Units: "John bought 3 dozen eggs." Models sometimes fail to multiply by 12.
Comparison to Reference: "Mary has 5 more apples than John, who has 8." Requires tracking two quantities.
Multi-step Chaining: 4+ operation problems in MultiArith expose breakdown in intermediate result tracking.

Relationship to Other Benchmarks

Benchmark	Difficulty	Focus
MAWPS	Elementary	Arithmetic
GSM8K	Middle school	Multi-step arithmetic
SVAMP	Elementary + adversarial	Robustness
MATH	Competition level	Creative reasoning
AQuA-RAT	GRE/GMAT	Algebraic reasoning

MAWPS is the elementary math class benchmark — historically essential for establishing arithmetic NLP baselines, now primarily serving as a sanity check confirming that modern LLMs have thoroughly mastered grade-school arithmetic word problems.

mawpsmawpsevaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.

🔍 Search Topics 💬 Ask CFSGPT 📚 Browse All