Key finding

Scaling laws describe predictable relationships between model size, data, compute, and performance in neural networks. Key finding: Loss decreases as power law with model parameters, dataset size, and compute. L proportional to N^(-alpha) where N is parameters. Implications: Can predict performance at scale from smaller experiments. Investment decisions based on extrapolation. Original work: Kaplan et al. (OpenAI, 2020) established relationships for language models. Variables: Model parameters (N), training tokens (D), compute (C in FLOPs), all show power-law relationships with loss. Practical use: Given compute budget, predict optimal model size and training duration. Plan training runs efficiently. Limitations: Emergent abilities may not follow power laws, diminishing returns at extreme scale, quality of data matters beyond quantity. Extensions: Chinchilla scaling (revised compute-optimal ratios), scaling laws for downstream tasks, multimodal scaling. Strategic importance: Drives multi-billion dollar compute investments at AI labs. Current status: Well-established for pre-training loss, less clear for downstream task performance and emergent abilities.

Want to learn more?