SVAMP (Simple Variations on Arithmetic Math word Problems)

SVAMP (Simple Variations on Arithmetic Math word Problems) is the adversarial robustness benchmark for math word problem solvers — created by applying minimal, meaning-preserving perturbations to existing problems to expose models that rely on keyword-based shortcuts rather than genuine mathematical understanding of problem structure.

What Is SVAMP?

- Scale: 1,000 math word problems derived from existing datasets (primarily ASDiv-A).
- Operations: Addition, subtraction, multiplication, and division — elementary school arithmetic only.
- Perturbation Types: Each problem is created by applying one of several "simple variations" to a source problem.
- Focus: Robustness testing — the mathematical operation required by the problem changes across variations, even when surface features remain similar.

The 7 Variation Types

Question Variation:
- Change "how many total?" to "how many more?" — changes the required operation from addition to subtraction.
- Change "what is the ratio?" to "how many times more?" — changes division framing.

Partition Variation:
- Restructure which entities are described in which clause.
- "John has 5 apples, Mary has 3. How many total?" → "Mary has 3 apples. John has 5 more than Mary. How many does John have?"

Irrelevant Information:
- Add a numerically distracting but irrelevant quantity to the problem.
- Forces the model to identify which numbers are actually needed.

Circular Variation:
- Present equivalent information in a different logical order.

Why Baseline Models Fail SVAMP

State-of-the-art models trained on standard datasets (ASDiv, MAWPS, MultiArith) showed catastrophic performance drops on SVAMP:

| Model | Standard Dataset | SVAMP |
|-------|-----------------|-------|
| GTS | 85.4% | 41.7% |
| Graph2Tree | 88.4% | 43.8% |
| NS-Solver | 89.1% | 47.1% |
| GPT-3 few-shot | ~75% | ~65% |

The gap reveals that models learned spurious correlations:
- "Gave" → Subtract: Problems containing "gave" usually involve transfer (subtraction), so models trigger subtraction on "gave" regardless of context.
- "Together/Total" → Add: Surface words signaling addition without reading the underlying mathematical relationship.
- Largest Number First: Many templates place the total or larger quantity first, causing models to learn positional rather than semantic cues.

Why SVAMP Matters

- Robustness Diagnosis: Reveals the difference between "learned the math" and "learned the dataset" — a critical distinction for real-world deployment.
- Minimal Variation Principle: SVAMP perturbations are semantically minimal — a human child can immediately solve both the original and variation. Models should too.
- Benchmark Inflation Problem: High accuracy on ASDiv/MAWPS was misleading. SVAMP showed those scores reflected dataset memorization, not arithmetic reasoning.
- Curriculum Design: SVAMP-style adversarial examples can be used during training to force models past shortcut learning.
- LLM Comparison: Even large LLMs (GPT-4) show non-trivial error rates on SVAMP, particularly on irrelevant information problems where distractor numbers appear.

Best Practices for Robust Math Models

- Operation Prediction: Train models to explicitly predict the required operation before generating the equation.
- Semantic Parsing: Parse problem structure into an equation tree rather than directly generating an answer.
- Data Augmentation: Include SVAMP-style perturbations during training to build robustness.
- Chain-of-Thought: Explicitly reasoning through which quantities are relevant dramatically reduces distractor-induced errors.

Connection to Broader Robustness Research

SVAMP belongs to a family of adversarial robustness benchmarks:
- HANS (NLI) — linguistic heuristic stress tests.
- PAWS (paraphrase detection) — structural adversarial examples.
- FEVEROUS (fact-checking) — evidence perturbation.

All share the same insight: high accuracy on standard splits does not imply robust generalization when minimal, human-obvious variations are applied.

SVAMP is the trick question test for arithmetic AI — proving that models genuinely understand mathematical logic only when they handle simple problem variations that reveal whether they mastered the underlying operations or merely memorized the superficial patterns of training data.

SVAMP (Simple Variations on Arithmetic Math word Problems)

Want to learn more?