Grokking and Delayed Generalization in Neural Networks is the phenomenon where a neural network first memorizes training data achieving perfect training accuracy, then much later suddenly generalizes to unseen data after continued training well past the point of overfitting — challenging conventional wisdom that test performance degrades monotonically once overfitting begins.
Discovery and Core Phenomenon
Grokking was first reported by Power et al. (2022) on algorithmic tasks (modular arithmetic, permutation groups). Networks achieved 100% training accuracy within ~100 optimization steps but required 10,000-100,000+ additional steps before test accuracy suddenly jumped from near-chance to near-perfect. The transition is sharp—a phase change rather than gradual improvement. This contradicts the classical bias-variance tradeoff suggesting that prolonged overfitting should degrade generalization.
Mechanistic Understanding
- Representation phase transition: The network initially memorizes training examples using high-complexity lookup-table-like representations, then discovers compact algorithmic solutions during extended training
- Weight norm dynamics: Memorization solutions have large weight norms; generalization solutions have smaller, more structured weights
- Circuit formation: Mechanistic interpretability reveals that generalizing networks learn interpretable circuits (e.g., Fourier features for modular addition) that emerge gradually during training
- Simplicity bias: Weight decay and other regularizers create pressure toward simpler solutions, but this pressure requires many steps to overcome the memorization basin
- Loss landscape: The memorization solution sits in a sharp minimum; the generalizing solution occupies a flatter, more robust region reached via continued optimization
Conditions That Promote Grokking
- Small datasets: Grokking is most pronounced when training data is limited relative to model capacity (high overparameterization ratio)
- Weight decay: Regularization is essential—without weight decay, grokking rarely occurs as the optimization has no incentive to leave the memorization solution
- Algorithmic structure: Tasks with learnable underlying rules (modular arithmetic, group operations, polynomial regression) exhibit grokking more readily than purely random mappings
- Learning rate: Moderate learning rates promote grokking; very high rates cause instability, very low rates delay or prevent the transition
- Data fraction: Grokking time scales inversely with training set size—more data accelerates the transition
Relation to Double Descent
- Epoch-wise double descent: Test loss first decreases, then increases (overfitting), then decreases again—related to but distinct from grokking
- Model-wise double descent: Increasing model size past the interpolation threshold causes test loss to decrease again
- Grokking vs double descent: Grokking involves a dramatic delayed jump in accuracy; double descent shows gradual U-shaped recovery
- Interpolation threshold: Both phenomena relate to the transition from underfitting to memorization to generalization in overparameterized models
Theoretical Frameworks
- Lottery ticket connection: Grokking may involve discovering sparse subnetworks (winning tickets) that implement the correct algorithm within the dense memorizing network
- Information bottleneck: Generalization emerges when the network compresses its internal representations, discarding memorized noise while preserving task-relevant structure
- Slingshot mechanism: Loss oscillations during training can catapult the network out of memorization basins into generalizing regions of the loss landscape
- Phase diagrams: Mapping grokking as a function of dataset size, model size, and regularization strength reveals clear phase boundaries between memorization and generalization
Practical Implications
- Training duration: Standard early stopping (based on validation loss plateau) may prematurely terminate training before grokking occurs—longer training with regularization can unlock generalization
- Curriculum learning: Presenting examples in structured order may accelerate the memorization-to-generalization transition
- Foundation models: Evidence suggests large language models may exhibit grokking-like behavior on reasoning tasks after extended pretraining
- Interpretability: Grokking provides a controlled setting to study how neural networks transition from memorization to understanding
Grokking reveals that the relationship between memorization and generalization in neural networks is far more nuanced than classical learning theory suggests, with profound implications for training schedules, regularization strategies, and our fundamental understanding of how deep networks learn.