BBH (BIG-bench Hard) is the curated subset of 23 BIG-bench tasks where state-of-the-art language models scored below average human performance — forming the primary evaluation suite for testing Chain-of-Thought reasoning and identifying the genuine reasoning boundaries of large language models beyond knowledge retrieval.
What Is BBH?
- Origin: Derived from BIG-bench (Beyond the Imitation Game benchmark), a community effort with 204 tasks. BBH isolates the 23 tasks where PaLM-540B performed below the average human rater.
- Scale: ~6,511 total examples across 23 tasks, roughly 250-350 examples per task.
- Format: Mix of multiple-choice and free-form generation tasks.
- Purpose: Distinguishes models that reason from models that merely retrieve — the tasks require multi-step logical manipulation, not just knowledge lookups.
The 23 BBH Tasks
Logical Deduction:
- Logical Deduction (3/5/7 objects): "Alice is taller than Bob, Bob is taller than Carol. Who is tallest?" — scaled to 7 objects.
- Causal Judgement: Given a scenario, determine which event caused the outcome.
- Formal Fallacies: Identify whether a syllogism is valid or contains a named fallacy (affirming the consequent, circular reasoning, etc.).
Symbolic and Algorithmic:
- Dyck Languages: Determine if a sequence of brackets is properly nested.
- Boolean Expressions: Evaluate compound boolean logic ("True AND (False OR NOT True)").
- Multi-step Arithmetic: Evaluate expressions with multiple operations and parentheses.
- Word Sorting: Sort a list of words alphabetically — tests character-level reasoning.
- Object Counting: Count objects satisfying compound predicates.
Language and World Model:
- Disambiguation QA: Resolve pronoun references in ambiguous sentences.
- Salient Translation Error Detection: Find meaningful errors in MT output.
- Penguins in a Table: Answer questions about structured data presented in natural language tables.
- Temporal Sequences: Determine the order of events described in text.
- Tracking Shuffled Objects: Track which object ends up where after a sequence of swaps.
Knowledge and Reasoning:
- Date Understanding: Calculate dates from relative descriptions ("What date is 3 weeks after March 15?").
- Sports Understanding: Determine if a sports statement is plausible.
- Ruin Arguments: Identify what would most damage a given argument.
- Hyperbaton: Detect unusual adjective ordering in English.
- Snarky Movie Reviews: Detect if a movie review is actually negative despite positive-sounding language.
Why BBH Matters
- Chain-of-Thought Calibration: BBH is the primary benchmark showing that standard prompting fails but Chain-of-Thought (CoT) prompting dramatically improves performance. Without CoT, GPT-3.5 achieves ~50% on BBH; with CoT, ~70%+.
- Reasoning vs. Retrieval Separation: Unlike MMLU (knowledge), BBH tasks have minimal knowledge requirements — they test symbolic manipulation, logical inference, and multi-step tracking.
- Model Discrimination: BBH separates GPT-4 from GPT-3.5 more cleanly than knowledge benchmarks, because reasoning ability scales differently from memorization capacity.
- Architecture Insights: Attention mechanisms theoretically support the tracking and comparison operations in BBH — but empirically, models struggle without explicit CoT scaffolding.
- Few-Shot Sensitivity: BBH performance is highly sensitive to prompt format and few-shot example quality, making it a probe for instruction following robustness.
Performance Comparison
| Model | BBH (Direct) | BBH (CoT 3-shot) |
|-------|-------------|-----------------|
| PaLM 540B | ~40% | ~52% |
| GPT-3.5 | ~50% | ~70% |
| GPT-4 | ~65% | ~83% |
| Claude 3 Opus | — | ~86% |
| Human average | ~88% | ~88% |
Evaluation Protocol
- 3-shot CoT: Provide 3 examples with step-by-step reasoning chains before the test question.
- Exact Match: Answers must exactly match the gold label (normalized for case and whitespace).
- Macro-average: Average accuracy across all 23 tasks — prevents easy tasks from dominating.
Limitations and Critiques
- Contamination Risk: Some BBH tasks (date understanding, boolean expressions) have templates easily regenerable — training data may contain similar examples.
- Task Diversity: The 23 tasks were selected by a specific metric (human > PaLM-540B) that may not reflect all important reasoning dimensions.
- English Only: No multilingual version, limiting cross-lingual reasoning assessment.
BBH is the reasoning filter for language models — isolating the 23 tasks that genuinely require thinking rather than knowing, making it the gold standard for evaluating Chain-of-Thought prompting and measuring how close AI comes to human-level logical reasoning.