MATH

MATH is the competition-level mathematics benchmark of 12,500 problems drawn from AMC, AIME, and similar olympiad contests — designed to probe whether language models can perform creative, multi-step mathematical reasoning far beyond grade-school arithmetic, using problems that challenge even gifted human students.

What Is the MATH Dataset?

- Scale: 12,500 problems — 7,500 training, 5,000 test.
- Source: Problems from AMC 8, AMC 10, AMC 12, AIME, and HMMT competitions.
- Format: Free-form LaTeX input and solution, with a final boxed answer.
- Subjects: Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Prealgebra, Precalculus.
- Difficulty Levels: 1 (easiest) to 5 (hardest), where Level 5 problems require olympiad-level insight.

Why MATH Is Fundamentally Hard

Unlike arithmetic datasets (GSM8K, MAWPS) where the solution path is straightforward, MATH problems require:

- Insight Steps: "Notice that the expression is a perfect square" — non-obvious algebraic manipulations.
- Multiple Solution Strategies: Different approaches (substitution, induction, combinatorial argument) must be selected appropriately.
- Symbolic Precision: LaTeX output must be exactly correct — "$frac{3}{7}$" not "3/7".
- Long Solution Chains: Competition problems routinely require 10-15 logical steps, each building on the previous.
- Elegant Tricks: AMC/AIME problems often have "trick" solutions that brute-force arithmetic misses entirely.

Performance Timeline

| Model | Year | MATH Accuracy |
|-------|------|--------------|
| GPT-3 | 2020 | ~4.5% |
| Minerva 540B | 2022 | 33.6% |
| GPT-4 | 2023 | ~52% |
| GPT-4 with CoT | 2023 | ~67% |
| o1 (reasoning model) | 2024 | ~94.8% |
| Expert human (AMC/AIME competitor) | — | ~90-95% |

The jump from GPT-4 (~52%) to o1 (~95%) demonstrates that extended chain-of-thought reasoning — essentially letting the model "think longer" — is the key to breakthrough math performance.

Subject Breakdown (GPT-4 performance)

| Subject | Accuracy |
|---------|---------|
| Prealgebra | ~76% |
| Algebra | ~62% |
| Counting & Probability | ~50% |
| Number Theory | ~55% |
| Intermediate Algebra | ~42% |
| Precalculus | ~45% |
| Geometry | ~40% |

Geometry and advanced algebra remain the hardest subjects due to visual reasoning requirements and complex symbolic manipulation.

Why MATH Matters

- Genuine Reasoning Test: Math has unambiguous correct answers — no subjectivity, no annotation errors. A correct solution is definitively correct.
- Failure Mode Diagnosis: Early models scored near 0% on Level 5 problems despite 50%+ on Level 1, proving that scaling alone was insufficient — reasoning architecture mattered.
- Training Data for Reasoning: MATH's 7,500 training problems with full solution chains became a key fine-tuning resource for math-capable models (Minerva, WizardMath, DeepSeekMath).
- Verifiable Generation: Math is one of the few domains where AI output can be automatically verified with a symbolic solver — enabling reinforcement learning from correct solutions.
- Real-World Proxy: Mathematical reasoning ability correlates with performance on engineering, physics, and quantitative finance tasks.

Evaluation Techniques

- Majority Voting (Self-Consistency): Generate 40 solutions, take the most common answer — improves accuracy ~8-12%.
- Tool-Augmented: Allow code execution (Python sympy/numpy) — dramatically improves accuracy for algebraic manipulation.
- Process Reward Models (PRM): Train a verifier to score intermediate reasoning steps, not just final answers — enables beam search over solution paths.

Extensions and Variants

- MATH-500: Benchmark subset of 500 carefully selected problems for faster evaluation.
- MATH-Odyssey: Harder 2024 extension with post-2022 competition problems (avoiding contamination).
- OlympiadBench: Extends to International Mathematical Olympiad (IMO) level problems.

MATH is the mathematical olympiad for AI — a dataset that separates models that perform arithmetic from models that genuinely reason, with a clear, verifiable correctness criterion that enables rigorous measurement of progress toward human-level mathematical problem solving.

Want to learn more?