QuAC (Question Answering in Context) is the conversational reading comprehension benchmark where a student who cannot see the article asks questions to a teacher who reads the article — modeling genuine information-seeking dialogue and testing a model's ability to answer context-dependent follow-up questions that build on prior conversation turns, handle topic shifts, and recognize when questions cannot be answered from the provided text.
The Information-Seeking Design
Most QA benchmarks are constructed by crowdworkers who read a passage and then write questions about it — a retrospective process that often produces questions whose answers are already mentally available to the question writer. This "knowledge-asymmetric" setup produces unnatural questions.
QuAC inverts this: a "student" who sees only the passage title and section headers asks questions to a "teacher" who reads the full Wikipedia passage. The student is genuinely information-seeking — asking questions to learn content they do not know — producing more natural, coherent conversational flows.
Dataset Construction
Setup: Two crowdworkers paired together. Worker 1 (teacher) sees a Wikipedia passage about a person. Worker 2 (student) sees only the person's name and section heading.
Interaction: The student asks 7–12 questions in sequence to learn about the person. The teacher selects a span from the passage as the answer or marks the question as unanswerable. The student sees each answer before asking the next question.
Scale: 98,407 question-answer pairs across 13,594 dialogues. Each dialogue covers a different Wikipedia person article section. Topics include musicians, politicians, athletes, authors, and historical figures.
The Information-Seeking Flow
A typical QuAC dialogue about a musician:
Turn 1: "Where was she born?" → "Nashville, Tennessee."
Turn 2: "What genre of music does she play?" → "Country and pop."
Turn 3: "Did she have any early musical influences?" → "Her grandmother, who sang in church choirs."
Turn 4: "How old was she when she started performing?" → CANNOTANSWER.
Turn 5: "When did she release her first album?" → "2006."
Turn 6: "What was it called?" → (Reference to previous answer: "her first album" = the entity from Turn 5) → "Taylor Swift."
Context Dependence and Follow-Up Questions
QuAC's central challenge is context dependence across turns:
Pronoun Reference: "What did she do next?" — "she" refers to the article subject, and "next" is relative to whatever event was last discussed.
Implicit Topic: "Was it successful?" — "it" refers to whatever was discussed in the previous answer, without any explicit anchor in the current question.
Topic Shift: After several questions about early life, the student may ask about later career. The model must recognize the discourse is shifting and not continue reasoning about the previous topic.
Follow-Up Specificity: "Tell me more about that." — requires the model to expand on the most recently answered content rather than re-answering the question.
These context dependencies require maintaining a dialogue state across turns, not just answering each question independently.
QuAC vs. CoQA
QuAC and CoQA (Conversational QA) are the two dominant conversational QA benchmarks:
| Aspect | QuAC | CoQA |
|--------|------|------|
| Design | Information-seeking (student/teacher) | Collaborative reading |
| Answer format | Passage spans or CANNOTANSWER | Free-form + passage spans |
| Passage type | Wikipedia (persons) | Mixed domains |
| Turn count | 7–12 per dialogue | Variable |
| Key challenge | Context dependence | Abstraction and paraphrase |
| Scale | 98K questions | 127K questions |
QuAC questions are more naturally context-dependent because the student cannot see the passage; CoQA questions are more varied in answer format because annotators can freely abstract from the passage.
The CANNOTANSWER Label
A significant portion of QuAC questions (22.2%) are marked CANNOTANSWER — questions the teacher determines cannot be answered from the passage. This requires the model to:
1. Attempt to find evidence in the passage.
2. If no evidence exists, output CANNOTANSWER rather than confabulating an answer.
Recognizing unanswerability is challenging because some questions that seem unanswerable actually have subtle answers in the passage, and vice versa. This tests calibrated uncertainty: the model should not answer when it should abstain, and should not abstain when the answer is present.
Evaluation
QuAC is evaluated using Human Equivalence Score (HEQ):
- HEQ-Q: The fraction of individual questions answered as well as a human would.
- HEQ-D: The fraction of entire dialogues answered as well as a human would across all turns.
- F1: Token-level F1 for span answers.
Human performance: F1 ≈ 86.7. Models typically achieve F1 of 65–80, with context-tracking being the primary source of the gap.
Applications
QuAC models real assistant interaction patterns:
- Virtual Assistants: Users ask follow-up questions that reference previous answers without restating context.
- Customer Support: "What about the return policy?" requires knowing what product was being discussed.
- Educational Tutoring: Students ask sequential questions that build on previously understood concepts.
- Document-Grounded Dialogue: Enterprise chatbots that answer from a knowledge base must handle the same context-dependent follow-up patterns.
QuAC is information-seeking dialogue grounded in text — the benchmark that tests whether models can engage in genuine multi-turn conversations where each question depends on prior answers, handling the pronoun references, topic continuations, and answerability judgments that make real-world conversational QA fundamentally harder than isolated reading comprehension.