Home Knowledge Base Scaling laws

Scaling laws are empirical relationships that predict how LLM performance improves with increased compute, parameters, and training data — following power-law curves that enable precise planning of training runs, showing that larger models trained on more data systematically achieve lower loss, guiding billion-dollar decisions in AI development.

What Are Scaling Laws?

Why Scaling Laws Matter

Key Scaling Relationships

Kaplan Scaling (2020):

L(N) ∝ N^(-0.076)    Loss vs. parameters
L(D) ∝ D^(-0.095)    Loss vs. data tokens
L(C) ∝ C^(-0.050)    Loss vs. compute

Where:
- N = number of parameters
- D = dataset size (tokens)
- C = compute (FLOPs)

Chinchilla Scaling (2022):

Optimal compute allocation:
N_opt ∝ C^0.5 (parameters grow with sqrt of compute)
D_opt ∝ C^0.5 (data grows with sqrt of compute)

Ratio: ~20 tokens per parameter

Example:
7B params → 140B tokens optimal
70B params → 1.4T tokens optimal

Scaling Law Comparison

Approach   | Params vs. Data | Key Insight
-----------|-----------------|--------------------------------
Kaplan     | 3:1 compute     | Scale params faster than data
Chinchilla | 1:1 compute     | Balance params and data equally
Practice   | Varies          | Over-train for inference efficiency

Compute-Optimal Training

Chinchilla-Optimal:

Inference-Optimal (Modern Practice):

Practical Scaling Examples

Model          | Params | Training Tokens | Tokens/Param
---------------|--------|-----------------|---------------
GPT-3          | 175B   | 300B            | 1.7
Chinchilla     | 70B    | 1.4T            | 20
Llama-2-70B    | 70B    | 2T              | 29
Llama-3-8B     | 8B     | 15T             | 1,875
GPT-4 (est.)   | 1.8T   | ~15T+           | ~8

Emergent Capabilities

Loss scale smoothly, but capabilities can emerge suddenly:

Loss: 3.0 → 2.5 → 2.0 → 1.8 (smooth decline)
Capability: No → No → No → Yes! (step function)

Examples of emergence:
- Chain-of-thought reasoning: >~10B params
- Multi-step math: >~50B params
- Code generation: >~10B params

Scaling Dimensions

Parameters (N):

Training Data (D):

Compute (C):

Implications for Practice

For Training:

For Inference:

Scaling laws are the physics of AI development — they transform AI progress from unpredictable to forecastable, enabling rational resource allocation and explaining why continued investment in larger models and more data yields systematic capability improvements.

scaling lawscaleparametersdatacomputechinchillapower lawtraining efficiency

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.