LogiQA

Keywords: logiqa, evaluation

LogiQA is the logical reasoning benchmark sourced from the Chinese National Civil Service Examination (NCSE) — providing multiple-choice reading comprehension questions that require formal deductive and inductive reasoning, making it one of the most challenging standardized logic benchmarks for language models and a key test of whether models can approximate a logical inference engine.

What Is LogiQA?

- Scale: 8,678 multiple-choice questions (4 options) with 651 training and 651 test examples in the primary split (LogiQA 1.0); LogiQA 2.0 expands to ~35,000 examples.
- Source: Translated from the Chinese Civil Service Examination — a rigorous standardized test used for government employment in China.
- Format: Short passage + multi-choice question requiring logical inference over the passage.
- Language: Originally Chinese, with an English translation; LogiQA 2.0 includes parallel bilingual versions.

The Five Logic Types Covered

Categorical Logic (Class Inclusion/Exclusion):
- "All engineers are employees. Some employees are managers. Can some engineers be managers?" — Syllogistic reasoning.

Conditional Logic (If-Then Chains):
- "If A then B. If B then C. A is true. Is C true?" — Modus ponens, chain rules.

Disjunctive Reasoning (Either-Or):
- "Either X or Y must be true. X is false. Therefore Y." — Disjunctive syllogism.

Causal Analysis:
- "Sales dropped after the policy change. Which conclusion best explains this?" — Abductive inference.

Argument Evaluation:
- "Which fact most weakens the argument that..." — Requires understanding argument structure and finding defeating evidence.

Why LogiQA Is Hard for LLMs

- Non-Statistical Answers: The correct answer follows from logical necessity, not from what is statistically most plausible in pretraining text. A model cannot "guess" based on word frequencies.
- Negation Sensitivity: "Not all A are B" is fundamentally different from "No A are B." Models systematically confuse these.
- Multi-Premise Chaining: Many problems require holding 3-4 premises simultaneously and performing multi-step deductive closure.
- Distractor Quality: Wrong answer options in NCSE are specifically designed to be plausible — they represent tempting but invalid logical conclusions, exactly what distinguishes human reasoning ability.

Performance Results

| Model | LogiQA 1.0 Accuracy |
|-------|-------------------|
| Random baseline | 25.0% |
| Human (NCSE examinees) | ~86% |
| RoBERTa-large | 35.3% |
| DAGN (graph-augmented) | 39.9% |
| GPT-3.5 | ~58% |
| GPT-4 | ~72% |
| GPT-4 + CoT | ~80% |

LogiQA 2.0 Improvements

LogiQA 2.0 (2023) addresses weaknesses of the original:
- NLI Format: Each question is reframed as a natural language inference problem (entailment/contradiction/neutral).
- Bilingual: Chinese and English versions with consistent difficulty.
- Balanced Categories: Equal distribution across the 5 logic types.
- Expanded Scale: ~35,000 examples enabling larger-scale fine-tuning studies.

ReClor Comparison

LogiQA is often paired with ReClor (from LSAT Logical Reasoning) for logic evaluation:

| Benchmark | Source | Scale | Focus |
|-----------|--------|-------|-------|
| LogiQA | Chinese NCSE | 8.7k | Formal deductive/inductive |
| ReClor | LSAT | 6.1k | Analytical argument evaluation |
| AR-LSAT | LSAT | 2.0k | Constraint satisfaction |

All three require multi-step logical reasoning but differ in reasoning style — LogiQA emphasizes categorical and conditional logic, ReClor focuses on argument analysis.

Why LogiQA Matters

- Cross-Cultural Logic Test: Demonstrating that rigorous logical reasoning is culturally universal — NCSE logic problems transfer cleanly to English.
- Government AI Applications: Civil service AI (policy analysis, legal reasoning, regulatory compliance) requires exactly the logical reasoning that LogiQA tests.
- Commonsense vs. Formal Logic: LogiQA highlights the gap between models' strong common-sense reasoning (commonsense QA benchmarks) and their weaker formal deductive reasoning.
- Compositional Reasoning: Each logic type tests a building block of compositional reasoning — the ability to chain simple rules into complex valid conclusions.

LogiQA is civil service logic for AI — adapting the rigorous deductive and inductive reasoning standards that governments use to select public administrators, providing language models with a demanding test of whether they can actually follow chains of formal logical argumentation.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT