Scaling laws are empirical relationships that predict how LLM performance improves with increased compute, parameters, and training data — following power-law curves that enable precise planning of training runs, showing that larger models trained on more data systematically achieve lower loss, guiding billion-dollar decisions in AI development.
What Are Scaling Laws?
- Definition: Mathematical relationships between scale (compute, params, data) and performance.
- Form: Power laws: Loss ∝ X^(-α) for scale factor X.
- Utility: Predict performance before training, optimize resource allocation.
- Origin: OpenAI (Kaplan 2020), refined by Chinchilla (Hoffmann 2022).
Why Scaling Laws Matter
- Investment Planning: Decide how much compute to buy.
- Model Sizing: Choose optimal parameter count for budget.
- Data Requirements: Know how much training data needed.
- Performance Prediction: Forecast capability improvements.
- Research Direction: Understand what drives progress.
Key Scaling Relationships
Kaplan Scaling (2020):
L(N) ∝ N^(-0.076) Loss vs. parameters
L(D) ∝ D^(-0.095) Loss vs. data tokens
L(C) ∝ C^(-0.050) Loss vs. compute
Where:
- N = number of parameters
- D = dataset size (tokens)
- C = compute (FLOPs)
Chinchilla Scaling (2022):
Optimal compute allocation:
N_opt ∝ C^0.5 (parameters grow with sqrt of compute)
D_opt ∝ C^0.5 (data grows with sqrt of compute)
Ratio: ~20 tokens per parameter
Example:
7B params → 140B tokens optimal
70B params → 1.4T tokens optimal
Scaling Law Comparison
Approach | Params vs. Data | Key Insight
-----------|-----------------|--------------------------------
Kaplan | 3:1 compute | Scale params faster than data
Chinchilla | 1:1 compute | Balance params and data equally
Practice | Varies | Over-train for inference efficiency
Compute-Optimal Training
Chinchilla-Optimal:
- Equal compute between model size and data.
- 20 tokens per parameter.
- Best loss for given compute budget.
Inference-Optimal (Modern Practice):
- Over-train smaller models (200+ tokens/param).
- Better inference:quality ratio.
- Llama-3 trained 15T tokens on 8B model (1875× tokens/param).
Practical Scaling Examples
Model | Params | Training Tokens | Tokens/Param
---------------|--------|-----------------|---------------
GPT-3 | 175B | 300B | 1.7
Chinchilla | 70B | 1.4T | 20
Llama-2-70B | 70B | 2T | 29
Llama-3-8B | 8B | 15T | 1,875
GPT-4 (est.) | 1.8T | ~15T+ | ~8
Emergent Capabilities
Loss scale smoothly, but capabilities can emerge suddenly:
Loss: 3.0 → 2.5 → 2.0 → 1.8 (smooth decline)
Capability: No → No → No → Yes! (step function)
Examples of emergence:
- Chain-of-thought reasoning: >~10B params
- Multi-step math: >~50B params
- Code generation: >~10B params
Scaling Dimensions
Parameters (N):
- More parameters = more model capacity.
- Diminishing returns (power law).
- Memory and inference cost scales linearly.
Training Data (D):
- More data = better generalization.
- Quality matters as much as quantity.
- Data mixing crucial (code, math, text).
Compute (C):
- C ≈ 6 × N × D (rough approximation).
- Can trade params for data at same compute.
- Training time = C / (hardware FLOPS).
Implications for Practice
For Training:
- Know your compute budget → derive optimal N and D.
- Quality data is increasingly bottleneck.
- Synthetic data to extend data scaling.
For Inference:
- Smaller models trained longer = better inference economics.
- MoE to decouple parameters from compute.
- Distillation to compress scaling gains.
Scaling laws are the physics of AI development — they transform AI progress from unpredictable to forecastable, enabling rational resource allocation and explaining why continued investment in larger models and more data yields systematic capability improvements.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.