Autoregressive Retrieval is the dynamic retrieval strategy that conditions each retrieval step on previously generated tokens — triggering document retrieval mid-generation when the model encounters uncertainty or information gaps, then continuing generation informed by the freshly retrieved context — the adaptive approach that transforms retrieval from a one-shot preprocessing step into an iterative, generation-aware process that retrieves exactly the information needed at precisely the point it is needed.
What Is Autoregressive Retrieval?
- Definition: A generation paradigm where retrieval is interleaved with autoregressive token generation — the model generates tokens until a retrieval trigger fires, formulates a query from the generation context, retrieves relevant passages, and continues generating conditioned on both the partial generation and the retrieved information.
- Generation-Aware Queries: Unlike single upfront retrieval (where the query is the original input), autoregressive retrieval formulates queries from the generation context — the partial answer itself informs what information is needed next.
- Multi-Step Retrieval: Complex questions may trigger multiple retrieval steps — each step refines the query based on what has been generated and retrieved so far, enabling iterative knowledge acquisition.
- Retrieval Triggers: Retrieval is activated by: (1) fixed intervals (every N tokens), (2) model uncertainty (low confidence in next-token prediction), (3) learned special tokens ([RETRIEVE] token), or (4) explicit forward-looking assessment.
Why Autoregressive Retrieval Matters
- Answers Evolve During Generation: For multi-part questions, the information needed for the second part depends on the answer to the first part — upfront retrieval cannot anticipate this dependency, but autoregressive retrieval adapts.
- Multi-Hop Reasoning: Questions requiring chains of facts (A→B→C) need sequential retrieval — retrieve A, use A to formulate query for B, retrieve B, use A+B to find C.
- Self-Correcting: If early generation diverges from correct reasoning, subsequent retrieval can provide corrective information — the model has opportunities to "course-correct" mid-generation.
- Query Specificity: Queries formulated from partial generation are more specific than the original input — retrieving more targeted, relevant passages.
- Reduced Hallucination: Retrieval at the point of uncertainty prevents the model from confabulating when it lacks knowledge — it pauses and retrieves instead.
Autoregressive Retrieval Implementations
FLARE (Forward-Looking Active Retrieval):
- Generate continuation with low confidence → use low-confidence span as retrieval query.
- If generated tokens have prediction probability < threshold, trigger retrieval.
- Re-generate the low-confidence span conditioned on retrieved passages.
- Forward-looking: retrieves information for what the model is about to say, not what it already said.
Self-RAG (Self-Reflective RAG):
- Model generates special tokens indicating: (1) whether retrieval is needed, (2) whether retrieved passage is relevant, (3) whether generation is supported by retrieval.
- Trained with reflection tokens via instruction tuning.
- Self-evaluating: the model itself decides retrieval necessity and assesses retrieval quality.
IRCoT (Interleaving Retrieval with Chain-of-Thought):
- Alternate between CoT reasoning steps and retrieval steps.
- Each reasoning step generates a sub-question; retrieval provides evidence for the next step.
- Combines structured reasoning with dynamic evidence gathering.
Autoregressive vs. Standard Retrieval
| Aspect | Single-Shot Retrieval | Autoregressive Retrieval |
|--------|----------------------|------------------------|
| Retrieval Timing | Before generation | During generation |
| Query Source | Original input only | Generation context |
| Retrieval Count | Once per query | Multiple per generation |
| Multi-Hop | Must anticipate all hops | Natural sequential discovery |
| Latency | Lower (one retrieval) | Higher (multiple retrievals) |
| Adaptiveness | Fixed context | Evolves with generation |
Autoregressive Retrieval is the paradigm shift from retrieval-then-generate to retrieve-as-you-generate — recognizing that the information needs of a generation process are not fully knowable at the start and must be discovered dynamically as the response unfolds, enabling the kind of iterative knowledge-gathering that characterizes expert human reasoning.