Home Knowledge Base BBH (BIG-bench Hard)

BBH (BIG-bench Hard) is the curated subset of 23 BIG-bench tasks where state-of-the-art language models scored below average human performance — forming the primary evaluation suite for testing Chain-of-Thought reasoning and identifying the genuine reasoning boundaries of large language models beyond knowledge retrieval.

What Is BBH?

The 23 BBH Tasks

Logical Deduction:

Symbolic and Algorithmic:

Language and World Model:

Knowledge and Reasoning:

Why BBH Matters

Performance Comparison

ModelBBH (Direct)BBH (CoT 3-shot)
PaLM 540B~40%~52%
GPT-3.5~50%~70%
GPT-4~65%~83%
Claude 3 Opus~86%
Human average~88%~88%

Evaluation Protocol

Limitations and Critiques

BBH is the reasoning filter for language models — isolating the 23 tasks that genuinely require thinking rather than knowing, making it the gold standard for evaluating Chain-of-Thought prompting and measuring how close AI comes to human-level logical reasoning.

bbhbbhevaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.