Home Knowledge Base Grokking and Delayed Generalization in Neural Networks

Grokking and Delayed Generalization in Neural Networks is the phenomenon where a neural network first memorizes training data achieving perfect training accuracy, then much later suddenly generalizes to unseen data after continued training well past the point of overfitting — challenging conventional wisdom that test performance degrades monotonically once overfitting begins.

Discovery and Core Phenomenon

Grokking was first reported by Power et al. (2022) on algorithmic tasks (modular arithmetic, permutation groups). Networks achieved 100% training accuracy within ~100 optimization steps but required 10,000-100,000+ additional steps before test accuracy suddenly jumped from near-chance to near-perfect. The transition is sharp—a phase change rather than gradual improvement. This contradicts the classical bias-variance tradeoff suggesting that prolonged overfitting should degrade generalization.

Mechanistic Understanding

Conditions That Promote Grokking

Relation to Double Descent

Theoretical Frameworks

Practical Implications

Grokking reveals that the relationship between memorization and generalization in neural networks is far more nuanced than classical learning theory suggests, with profound implications for training schedules, regularization strategies, and our fundamental understanding of how deep networks learn.

grokking delayed generalizationneural network grokkingdouble descent generalizationmemorization to generalization transitionphase transition learning

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.