Attention sink

Attention sink is the phenomenon where certain tokens attract disproportionate attention mass, reducing effective use of other context tokens - it can degrade long-context quality when not managed in prompt and model design.

What Is Attention sink?

- Definition: A token-level imbalance in attention allocation where a few positions dominate attention flow.
- Typical Triggers: Can arise from special tokens, repetitive prefixes, or positional effects in long prompts.
- Observed Impact: Important evidence may be under-attended when sink tokens absorb model focus.
- Analytical Role: Used as a diagnostic concept in long-context behavior evaluation.

Why Attention sink Matters

- Grounding Risk: Relevant retrieved passages can be ignored if attention concentrates elsewhere.
- Quality Drift: Responses may over-reference boilerplate text instead of factual evidence.
- Prompt Sensitivity: Minor formatting changes can shift attention allocation and output quality.
- Model Selection: Different architectures show different sink-token behavior under long inputs.
- Performance Debugging: Identifying sink patterns helps explain unexplained reasoning failures.

How It Is Used in Practice

- Attention Inspection: Use probing tools to visualize token attention distribution on representative prompts.
- Prompt Refactoring: Reduce repetitive scaffolding and reposition key evidence tokens.
- Mitigation Policies: Combine retrieval reordering and context compression to limit sink dominance.

Attention sink is a critical diagnostic concept for long-context reliability - monitoring and mitigating sink behavior improves evidence utilization in RAG workloads.

Want to learn more?