GOAT (Good at Arithmetic Tasks) is a Llama-based language model fine-tuned specifically for arithmetic calculation, demonstrating that targeted synthetic data training can solve the fundamental tokenization problem that makes standard LLMs fail at basic math — achieving state-of-the-art performance on multi-digit addition, subtraction, multiplication, and division by training on carefully structured arithmetic examples that teach the model columnar computation strategies, even outperforming GPT-4 on certain large-number operations at time of release.
Why LLMs Fail at Arithmetic
- Tokenization Problem: Standard LLMs tokenize "12345" as subword chunks like "123" + "45" or "1" + "2345" — destroying the digit-level alignment needed for columnar arithmetic. The model literally cannot see individual digits in consistent positions.
- Pattern vs. Computation: LLMs learn statistical patterns, not algorithms. They memorize that "2+2=4" from training data but cannot generalize to "47293+81956" because that specific sum was never in training.
- Carry Propagation: Multi-digit addition requires carrying across columns — a sequential, algorithmic process that autoregressive generation handles poorly without explicit training.
The GOAT Solution
| Component | Approach | Result |
|-----------|----------|--------|
| Base Model | Llama-7B | Strong language understanding foundation |
| Training Data | Synthetic arithmetic dataset with step-by-step solutions | Teaches columnar computation |
| Format | "Q: 47293 + 81956 = ? A: Let me compute step by step..." | Chain-of-thought arithmetic |
| Operations | Addition, subtraction, multiplication, division | Full arithmetic coverage |
Key Innovation: GOAT's training data presents arithmetic problems with explicit intermediate steps — showing the model how to align digits, propagate carries, and verify results. This transforms arithmetic from pattern-matching into learned algorithmic execution.
Performance
| Task | GOAT-7B | GPT-4 | Llama-7B (base) |
|------|---------|-------|----------------|
| Large addition (10+ digits) | 99%+ | ~85% | <10% |
| Large multiplication | 95%+ | ~70% | <5% |
| Division with remainders | 90%+ | ~80% | <5% |
Significance: GOAT proved that domain-specific fine-tuning on synthetic data can solve fundamental LLM limitations — the tokenization problem isn't inherent to the architecture but addressable through targeted training. This influenced subsequent math-specialized models (MAmmoTH, MetaMath, Llemma) and validated the approach of using synthetic datasets to teach LLMs algorithmic reasoning.
GOAT is a landmark demonstration that LLMs can learn genuine computation — proving that fine-tuning with structured arithmetic examples enables models to perform reliable multi-digit calculation that base models and even frontier systems struggle with, establishing synthetic data as the key to teaching algorithmic skills.