CFQ (Compositional Freebase Questions)

CFQ (Compositional Freebase Questions) is the large-scale semantic parsing benchmark for measuring compositional generalization in natural language to SPARQL query translation over the Freebase knowledge graph — introducing the Maximum Compound Divergence (MCD) split methodology that maximizes the structural difference between training and test compounds, creating a rigorous compositional generalization test that exposed the limitations of standard seq2seq and pretrained language models.

What Is CFQ?

- Origin: Keysers et al. (2020) from Google Research.
- Task: Map natural language questions to SPARQL queries over Freebase.
- "Who directed films produced by X?" → SELECT ?x WHERE { ?film prod:producer ns:X. ?film movie:director ?x }
- "Did M1 and M2 star the same set of actors?" → Multi-join SPARQL with overlap predicates.
- Scale: 239,357 question-query pairs; evaluated on 3 MCD splits (MCD1, MCD2, MCD3).
- Knowledge Base: Freebase — a large-scale knowledge graph with entities, relations, and types.

The MCD Split Innovation

Standard random train/test splits for semantic parsing are misleading — they allow the test set to contain the same predicate combinations as training, inflating accuracy estimates. MCD (Maximum Compound Divergence) creates splits that maximize structural novelty:

- Atom: Individual predicates, entities, and query patterns.
- Compound: Multi-predicate query patterns — e.g., a 3-join SPARQL pattern like ?film director ?x. ?film actor ?y. ?film producer ?z.
- MCD Principle: Training and test sets have similar atom distributions (same predicates appear) but maximally different compound distributions — test queries require combining predicates in ways absent from training.

This design means a model that perfectly memorizes training compounds will score near 0% on MCD splits — only models that learn reusable predicate-level rules will generalize.

CFQ Results and the Generalization Gap

| Model | MCD1 | MCD2 | MCD3 | Average |
|-------|------|------|------|---------|
| Seq2Seq (LSTM) | 28.9% | 5.0% | 10.8% | 14.9% |
| Transformer | 34.9% | 8.2% | 10.6% | 17.9% |
| BERT fine-tuned | 42.0% | 9.6% | 14.3% | 22.0% |
| T5 large | 62.0% | 30.1% | 31.2% | 41.1% |
| Compositional Struct. (~2023) | 81.0% | 51.0% | 60.0% | 64.0% |
| Human equivalent | ~97%+ | ~97%+ | ~97%+ | ~97%+ |

The dramatic drop from random split (~97%) to MCD splits (~14-40%) demonstrates that standard models are "memorizing compounds, not learning rules."

Why CFQ Matters

- Semantic Parsing Reliability: NL-to-SQL, NL-to-SPARQL, and NL-to-API systems deployed in production will encounter queries that combine predicates in novel ways. CFQ measures whether the underlying model will generalize or fail.
- Knowledge Graph QA: As KGs (Wikidata, Freebase, corporate knowledge graphs) become key AI infrastructure, CFQ evaluates whether neural semantic parsers can reliably translate complex natural language queries into correct graph traversals.
- Evaluation Methodology Contribution: The MCD split methodology is reusable — it can be applied to any semantic parsing dataset to create meaningful compositional generalization benchmarks.
- Pretraining Inefficiency: CFQ showed that massive pretrained language models (BERT, T5) still fail dramatically on compositional generalization — pretraining alone does not solve compositionality.
- Architecture Direction: CFQ results motivated LEAR, Compositional Transformers, and grammar-augmented models specifically designed to disentangle primitive representations from compositional rules.

Extensions

- ATIS-CFQ: Applying MCD splits to the classic ATIS flight booking SQL dataset.
- GeoQuery-CFQ: MCD evaluation on the geographic QA-to-SQL benchmark.
- CodeCFQ: Extending MCD splits to code generation tasks.

Comparison to COGS and SCAN

| Benchmark | Output | Graph/DB Coverage | Compound Type | Scale |
|-----------|--------|------------------|--------------|-------|
| SCAN | Action sequences | None | Verb+adverb | 20k |
| COGS | λ-calculus | None | Syntactic roles | 24k |
| CFQ | SPARQL | Freebase (large KB) | Multi-join query patterns | 239k |

CFQ is SPARQL composition for real-world knowledge graphs — measuring whether AI can parse complex natural language questions into database queries by combining learned predicate primitives in novel ways, with the MCD split methodology providing the most rigorous framework available for evaluating compositional generalization in semantic parsing.

CFQ (Compositional Freebase Questions)

Want to learn more?