CLUTRR (Compositional Language Understanding and Text-based Relational Reasoning) is the diagnostic benchmark for inductive reasoning over kinship relations — testing whether models can learn compositional rules from text (Mother of Father = Grandmother) and systematically generalize them to longer relationship chains never seen during training, directly probing the length generalization failure of transformer architectures.
What Is CLUTRR?
- Origin: Developed by Sinha et al. (2019) at Mila/McGill University.
- Format: Short natural language stories describing family relationships → question about an unseen kinship relation.
- Key Property: Train on relationship chains of length 2-3, test on chains of length 4-10.
- Kinship Relations: Covers 20+ relations — parent, child, sibling, spouse, grandparent, grandchild, aunt, uncle, niece, nephew, cousin, and combinations thereof.
- Scale: Automatically generated — unlimited training examples by construction; test sets at each chain length.
Example (2-hop training vs. 5-hop testing)
2-hop training story:
"Sarah gives her son John a birthday card. John introduces Mary as his daughter."
Question: "What is Sarah to Mary?"
Answer: Grandmother.
Derivation: Sarah → (mother of) → John → (grandfather of / parent of) → Mary (granddaughter). Wait: Sarah is mother of John. John is father of Mary. Sarah is Grandmother of Mary. ✓
5-hop test story:
"Linda hugged her nephew Travis. Travis went to visit his son Robert. Robert's sister is Nina. Nina is married to Kevin. Kevin waved to his mother Carol."
Question: "What is Linda to Carol?"
Answer: Requires 5 composition steps: Linda → (aunt of) → Travis → (father of) → Robert → (brother of) → Nina → (daughter-in-law's husband's sister → ...). Requires systematic rule application.
Why Length Generalization Fails
Transformers exhibit a well-documented failure mode: they can learn 2-3 hop compositions but fail catastrophically on 5-7 hops. The reason:
- Training Distribution Memorization: The model learns statistical associations between entity mentions and relation words, not general composition rules.
- Attention Dilution: As chain length grows, relevant attention heads must "bridge" across more intermediate mentions — attention weight diffuses.
- No Explicit State: The model has no external memory to track "current entity in the chain" — it must implicitly maintain this in residual stream activations.
- Exponential Rule Combinations: 20 base relations compose into 20×20 = 400 2-hop patterns, 8,000 3-hop patterns — the model cannot memorize all compositions explicitly.
Performance Results
| Model | 2-hop | 3-hop | 5-hop | 10-hop |
|-------|-------|-------|-------|--------|
| RoBERTa-large | ~98% | ~82% | ~48% | ~22% |
| Graph Neural Network | ~99% | ~95% | ~78% | ~45% |
| GPT-4 (few-shot CoT) | ~99% | ~97% | ~89% | ~68% |
| Symbolic solver | 100% | 100% | 100% | 100% |
Why CLUTRR Matters
- Systematic Generalization: The "Holy Grail" debate in cognitive AI — do deep networks learn rules or memorize instances? CLUTRR provides a clean empirical answer: they memorize, and fail to generalize on length.
- Compositional Intelligence: Human understanding of "my father's sister's son is my cousin" is immediate and generalizes to any chain length — CLUTRR quantifies how far AI falls short of this.
- Architecture Research Driver: CLUTRR results drove research into memory-augmented transformers, graph neural networks, and neuro-symbolic hybrids as alternatives to standard attention for relational reasoning.
- Inductive Rule Learning: Unlike deductive benchmarks (LogiQA), CLUTRR tests induction — learning the rule parent(X,Y) ∧ parent(Y,Z) → grandparent(X,Z) from text examples.
- Genealogy and Knowledge Graphs: Real-world applications in genealogy reconstruction, knowledge graph completion, and social network analysis require exactly this compositional kinship reasoning.
CLUTRR is automated genealogy as a reasoning stress test — using the universally understood domain of family relationships to precisely measure whether AI can learn logical composition rules that generalize to arbitrarily complex kinship chains, or whether it memorizes training configurations and fails when the chain grows longer than it has seen before.