Scaling laws | ChipFoundryServices

Home› Knowledge Base› Scaling laws

Scaling laws are empirical relationships that predict how LLM performance improves with increased compute, parameters, and training data — following power-law curves that enable precise planning of training runs, showing that larger models trained on more data systematically achieve lower loss, guiding billion-dollar decisions in AI development.

What Are Scaling Laws?

Definition: Mathematical relationships between scale (compute, params, data) and performance.
Form: Power laws: Loss ∝ X^(-α) for scale factor X.
Utility: Predict performance before training, optimize resource allocation.
Origin: OpenAI (Kaplan 2020), refined by Chinchilla (Hoffmann 2022).

Why Scaling Laws Matter

Investment Planning: Decide how much compute to buy.
Model Sizing: Choose optimal parameter count for budget.
Data Requirements: Know how much training data needed.
Performance Prediction: Forecast capability improvements.
Research Direction: Understand what drives progress.

Key Scaling Relationships

Kaplan Scaling (2020):

L(N) ∝ N^(-0.076)    Loss vs. parameters
L(D) ∝ D^(-0.095)    Loss vs. data tokens
L(C) ∝ C^(-0.050)    Loss vs. compute

Where:
- N = number of parameters
- D = dataset size (tokens)
- C = compute (FLOPs)

Chinchilla Scaling (2022):

Optimal compute allocation:
N_opt ∝ C^0.5 (parameters grow with sqrt of compute)
D_opt ∝ C^0.5 (data grows with sqrt of compute)

Ratio: ~20 tokens per parameter

Example:
7B params → 140B tokens optimal
70B params → 1.4T tokens optimal

Scaling Law Comparison

Approach   | Params vs. Data | Key Insight
-----------|-----------------|--------------------------------
Kaplan     | 3:1 compute     | Scale params faster than data
Chinchilla | 1:1 compute     | Balance params and data equally
Practice   | Varies          | Over-train for inference efficiency

Compute-Optimal Training

Chinchilla-Optimal:

Equal compute between model size and data.
20 tokens per parameter.
Best loss for given compute budget.

Inference-Optimal (Modern Practice):

Over-train smaller models (200+ tokens/param).
Better inference:quality ratio.
Llama-3 trained 15T tokens on 8B model (1875× tokens/param).

Practical Scaling Examples

Model          | Params | Training Tokens | Tokens/Param
---------------|--------|-----------------|---------------
GPT-3          | 175B   | 300B            | 1.7
Chinchilla     | 70B    | 1.4T            | 20
Llama-2-70B    | 70B    | 2T              | 29
Llama-3-8B     | 8B     | 15T             | 1,875
GPT-4 (est.)   | 1.8T   | ~15T+           | ~8

Emergent Capabilities

Loss scale smoothly, but capabilities can emerge suddenly:

Loss: 3.0 → 2.5 → 2.0 → 1.8 (smooth decline)
Capability: No → No → No → Yes! (step function)

Examples of emergence:
- Chain-of-thought reasoning: >~10B params
- Multi-step math: >~50B params
- Code generation: >~10B params

Scaling Dimensions

Parameters (N):

More parameters = more model capacity.
Diminishing returns (power law).
Memory and inference cost scales linearly.

Training Data (D):

More data = better generalization.
Quality matters as much as quantity.
Data mixing crucial (code, math, text).

Compute (C):

C ≈ 6 × N × D (rough approximation).
Can trade params for data at same compute.
Training time = C / (hardware FLOPS).

Implications for Practice

For Training:

Know your compute budget → derive optimal N and D.
Quality data is increasingly bottleneck.
Synthetic data to extend data scaling.

For Inference:

Smaller models trained longer = better inference economics.
MoE to decouple parameters from compute.
Distillation to compress scaling gains.

Scaling laws are the physics of AI development — they transform AI progress from unpredictable to forecastable, enabling rational resource allocation and explaining why continued investment in larger models and more data yields systematic capability improvements.

scaling lawscaleparametersdatacomputechinchillapower lawtraining efficiency

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.

🔍 Search Topics 💬 Ask CFSGPT 📚 Browse All