LogiQA is the logical reasoning benchmark sourced from the Chinese National Civil Service Examination (NCSE) — providing multiple-choice reading comprehension questions that require formal deductive and inductive reasoning, making it one of the most challenging standardized logic benchmarks for language models and a key test of whether models can approximate a logical inference engine.
What Is LogiQA?
- Scale: 8,678 multiple-choice questions (4 options) with 651 training and 651 test examples in the primary split (LogiQA 1.0); LogiQA 2.0 expands to ~35,000 examples.
- Source: Translated from the Chinese Civil Service Examination — a rigorous standardized test used for government employment in China.
- Format: Short passage + multi-choice question requiring logical inference over the passage.
- Language: Originally Chinese, with an English translation; LogiQA 2.0 includes parallel bilingual versions.
The Five Logic Types Covered
Categorical Logic (Class Inclusion/Exclusion):
- "All engineers are employees. Some employees are managers. Can some engineers be managers?" — Syllogistic reasoning.
Conditional Logic (If-Then Chains):
- "If A then B. If B then C. A is true. Is C true?" — Modus ponens, chain rules.
Disjunctive Reasoning (Either-Or):
- "Either X or Y must be true. X is false. Therefore Y." — Disjunctive syllogism.
Causal Analysis:
- "Sales dropped after the policy change. Which conclusion best explains this?" — Abductive inference.
Argument Evaluation:
- "Which fact most weakens the argument that..." — Requires understanding argument structure and finding defeating evidence.
Why LogiQA Is Hard for LLMs
- Non-Statistical Answers: The correct answer follows from logical necessity, not from what is statistically most plausible in pretraining text. A model cannot "guess" based on word frequencies.
- Negation Sensitivity: "Not all A are B" is fundamentally different from "No A are B." Models systematically confuse these.
- Multi-Premise Chaining: Many problems require holding 3-4 premises simultaneously and performing multi-step deductive closure.
- Distractor Quality: Wrong answer options in NCSE are specifically designed to be plausible — they represent tempting but invalid logical conclusions, exactly what distinguishes human reasoning ability.
Performance Results
| Model | LogiQA 1.0 Accuracy |
|-------|-------------------|
| Random baseline | 25.0% |
| Human (NCSE examinees) | ~86% |
| RoBERTa-large | 35.3% |
| DAGN (graph-augmented) | 39.9% |
| GPT-3.5 | ~58% |
| GPT-4 | ~72% |
| GPT-4 + CoT | ~80% |
LogiQA 2.0 Improvements
LogiQA 2.0 (2023) addresses weaknesses of the original:
- NLI Format: Each question is reframed as a natural language inference problem (entailment/contradiction/neutral).
- Bilingual: Chinese and English versions with consistent difficulty.
- Balanced Categories: Equal distribution across the 5 logic types.
- Expanded Scale: ~35,000 examples enabling larger-scale fine-tuning studies.
ReClor Comparison
LogiQA is often paired with ReClor (from LSAT Logical Reasoning) for logic evaluation:
| Benchmark | Source | Scale | Focus |
|-----------|--------|-------|-------|
| LogiQA | Chinese NCSE | 8.7k | Formal deductive/inductive |
| ReClor | LSAT | 6.1k | Analytical argument evaluation |
| AR-LSAT | LSAT | 2.0k | Constraint satisfaction |
All three require multi-step logical reasoning but differ in reasoning style — LogiQA emphasizes categorical and conditional logic, ReClor focuses on argument analysis.
Why LogiQA Matters
- Cross-Cultural Logic Test: Demonstrating that rigorous logical reasoning is culturally universal — NCSE logic problems transfer cleanly to English.
- Government AI Applications: Civil service AI (policy analysis, legal reasoning, regulatory compliance) requires exactly the logical reasoning that LogiQA tests.
- Commonsense vs. Formal Logic: LogiQA highlights the gap between models' strong common-sense reasoning (commonsense QA benchmarks) and their weaker formal deductive reasoning.
- Compositional Reasoning: Each logic type tests a building block of compositional reasoning — the ability to chain simple rules into complex valid conclusions.
LogiQA is civil service logic for AI — adapting the rigorous deductive and inductive reasoning standards that governments use to select public administrators, providing language models with a demanding test of whether they can actually follow chains of formal logical argumentation.