Attention sink is the phenomenon where certain tokens attract disproportionate attention mass, reducing effective use of other context tokens - it can degrade long-context quality when not managed in prompt and model design.
What Is Attention sink?
- Definition: A token-level imbalance in attention allocation where a few positions dominate attention flow.
- Typical Triggers: Can arise from special tokens, repetitive prefixes, or positional effects in long prompts.
- Observed Impact: Important evidence may be under-attended when sink tokens absorb model focus.
- Analytical Role: Used as a diagnostic concept in long-context behavior evaluation.
Why Attention sink Matters
- Grounding Risk: Relevant retrieved passages can be ignored if attention concentrates elsewhere.
- Quality Drift: Responses may over-reference boilerplate text instead of factual evidence.
- Prompt Sensitivity: Minor formatting changes can shift attention allocation and output quality.
- Model Selection: Different architectures show different sink-token behavior under long inputs.
- Performance Debugging: Identifying sink patterns helps explain unexplained reasoning failures.
How It Is Used in Practice
- Attention Inspection: Use probing tools to visualize token attention distribution on representative prompts.
- Prompt Refactoring: Reduce repetitive scaffolding and reposition key evidence tokens.
- Mitigation Policies: Combine retrieval reordering and context compression to limit sink dominance.
Attention sink is a critical diagnostic concept for long-context reliability - monitoring and mitigating sink behavior improves evidence utilization in RAG workloads.