Primacy bias

Primacy bias is a training dynamics phenomenon in machine learning where examples presented early in training have disproportionately large influence on learned representations and model behavior — causing the model to develop feature detectors, decision boundaries, and internal representations biased toward the statistical structure of early training data, which can persist through the entire training run even after the model has processed orders of magnitude more subsequent examples, with particular severity in reinforcement learning where the replay buffer's composition early in training shapes the value function landscape in ways that resist later correction.

Why Early Examples Have Outsized Influence

The primacy bias stems from the sequential nature of gradient-based optimization:

Gradient interference: When early examples train the network to high loss-landscape curvature in certain directions, subsequent examples that require updates in conflicting directions face a "crowded" parameter space. The first examples effectively claim parameter capacity that later examples must compete for.

Representation anchoring: Neural networks learn hierarchical features incrementally. Early training examples shape the low-level features in early layers. These low-level features then become the "vocabulary" for all subsequent higher-level feature learning — making the representational basis path-dependent on what was seen first.

Learning rate decay interaction: Most training schedules use higher learning rates early and lower rates later (cosine annealing, linear warmup-decay). Higher early learning rates amplify the influence of early examples on the loss landscape, compounding the bias.

Empirical Evidence

Studies demonstrate primacy bias across settings:

Supervised learning: Training CIFAR-10 classifiers with shuffled vs. class-sorted initial batches shows 2-5% accuracy differences even after identical total training. The sorted curriculum leaves residual biases in learned filters that persist despite later shuffling.

NLP language models: Pre-training data order affects downstream task performance measurably. Documents seen in the first training epoch influence tokenizer statistics, vocabulary prioritization, and early attention patterns in ways that shape all subsequent learning.

Reinforcement learning (most severe): In DQN and its variants, early replay buffer samples are drawn almost entirely from the initial random policy. The Q-network trained predominantly on random behavior data develops value estimates for random states — which then guide the policy during the crucial early exploration phase, creating a feedback loop where poor early estimates lead to poor early experiences, which reinforce the poor estimates.

Nikishin et al. (2022): Primacy Bias in Deep RL

The defining study demonstrated that:
- DQN agents with periodic "network resets" (reinitializing the last layer periodically) dramatically outperform standard DQN on Atari games
- The improvement comes from breaking the primacy bias: the reset forces the network to relearn value estimates from scratch using the full current replay buffer rather than preserving early-biased estimates
- Similar to plasticity loss in continual learning — early training reduces the network's ability to adapt to new information

Primacy Bias vs. Catastrophic Forgetting

These are related but distinct phenomena:
- Catastrophic forgetting: Later learning overwrites earlier learning — opposite of primacy bias
- Primacy bias: Earlier learning resists overwriting by later learning

Both stem from the stability-plasticity dilemma: networks must be plastic enough to learn new information but stable enough to retain previously acquired knowledge. Primacy bias occurs when stability dominates early representations too strongly.

Mitigation Strategies

Data shuffling: The simplest intervention — randomize data order to prevent consecutive examples from sharing similar statistical structure. Reduces but does not eliminate primacy bias since gradient magnitudes still decay over training.

Curriculum design starting with diversity: Ensure the first batches of training contain diverse, representative samples across all classes and attribute distributions. Contrast with "easy first" curricula (which can exacerbate primacy bias).

Experience replay with prioritization: In RL, prioritized experience replay (PER) upweights samples with high temporal-difference error, actively counteracting the over-representation of early random-policy samples. Reservoir sampling ensures the replay buffer maintains uniform coverage over all training history.

Periodic network resets / shrink-and-perturb: Reset subsets of network weights periodically while perturbing others slightly, forcing re-learning from the current data distribution while preserving general knowledge. Effective in deep RL and continual learning.

Learning rate schedules: Cyclical learning rates (Smith, 2017) and warm restarts (SGDR) periodically increase learning rates, enabling the network to escape early-biased local minima and explore loss landscape regions shaped by later training data.

Understanding primacy bias is essential for practitioners designing training pipelines for large-scale models, where the computational cost of full re-training makes it critical to get the data ordering and initialization strategy right from the start.

Want to learn more?