Numeracy Analysis

Numeracy Analysis in NLP is the systematic study and evaluation of how well language models understand, represent, and generate numerical information — covering magnitude comparison, unit semantics, arithmetic, and number formatting, addressing the foundational weakness of statistical models that treat numbers as arbitrary token sequences rather than quantities on a linear scale.

What Is Numeracy in NLP?

Numeracy is distinct from mathematical problem-solving. It asks whether a model has an internal sense of number as a quantity:

- Magnitude Sense: Does the model "know" that 1,000,000 is much larger than 100?
- Plausibility: "A human weighs 70 kg" is plausible; "A human weighs 7,000 kg" is not — does the model recognize this?
- Unit Semantics: Does the model understand that "70 mph" and "112 km/h" refer to the same speed?
- Arithmetic Grounding: Can the model verify that 15% of 80 is 12, not just generate a plausible number?
- Ordinal Reasoning: "Third fastest" implies a ranked ordering of speeds.

Why Tokenization Breaks Numeracy

Standard BPE tokenization fragments numbers in non-intuitive ways:
- "1234" might tokenize as ["12", "34"] or ["1", "234"] depending on the vocabulary.
- "10000" and "9999" — consecutive integers — may share no subword tokens and appear linguistically unrelated.
- Magnitude is entirely implicit — the model must learn from context that "million" after a number means ×10⁶.

This is fundamentally different from human number processing, where the digit positional system explicitly encodes magnitude.

Key Research Findings

- Wallace et al. (2019) — "Do NLP Models Know Numbers?": Probed BERT embeddings for numeric knowledge. Found BERT has weak magnitude representations but can learn basic number comparison from fine-tuning.
- Thawani et al. (2021) — "Representing Numbers in NLP": Compared digit-by-digit encoding, scientific notation, numericalization (separate float embedding), and character models. No method dominates across all numeracy tasks.
- Berg-Kirkpatrick et al. — Scientific Numeracy: Models hallucinate scientific numbers (atomic masses, physical constants) with alarming frequency, suggesting that number facts in pretraining are not reliably memorized.

Numeracy Failure Modes in Deployed LLMs

- Unit Confusion: "The population of China is approximately 1.4 billion" — models sometimes confuse million/billion/trillion in generation.
- Year Arithmetic: "The policy was implemented 3 years after 2015" — models give inconsistent or wrong results.
- Percentage Errors: "Double from 50% is 100%" — correct — but "increase 50% by 25%" is frequently miscalculated.
- Scale Blindness: Generating "the building is 500 miles tall" without triggering implausibility detection.
- Context-Inconsistent Numbers: Stating a statistic correctly in one paragraph and contradicting it in another.

Evaluation Tasks for Numeracy

- Number Comparison: "Which is larger: 3/7 or 0.45?" — tests rational number comprehension.
- Magnitude Estimation: "A car weighs approximately ___ kg" — fill in a plausible range.
- Probing Classifiers: Train a linear probe on model embeddings to predict whether a number is in a range — reveals implicit representational quality.
- Arithmetic Verification: "Does 23 × 14 = 322?" — yes/no verification of calculation.
- NumGLUE (aggregated): Multi-task evaluation covering all numeracy dimensions.

Improvement Strategies

- Digit-by-Digit Tokenization: Represent "1234" as ["1", "2", "3", "4"] — preserves positional magnitude information.
- Scientific Notation Normalization: Convert all numbers to d.ddd × 10^n before tokenization.
- Number-Span Embeddings: Special embeddings that encode the parsed float value of a number token span.
- Tool Use: Route numeric computation to a calculator or code interpreter — sidestep the representation problem entirely.
- Pretraining Data Engineering: Include more mathematical and scientific text, tables, and spreadsheet data.

Numeracy Analysis is number sense for AI — the critical research program ensuring that language models treat numbers as quantities with magnitude and units rather than arbitrary text sequences, addressing a foundational weakness that causes systematic hallucination in technical, financial, and scientific domains.

Want to learn more?