COGS (Compositional Generalization related to Semantic parsing)

COGS (Compositional Generalization related to Semantic parsing) is the semantic parsing benchmark for testing systematic compositional generalization — mapping English sentences to logical form representations (lambda calculus notation) with controlled splits that hold out specific lexical and structural combinations to measure whether models genuinely learn reusable syntactic and semantic rules or merely memorize training instances.

What Is COGS?

- Origin: Kim & Linzen (2020), motivated by the formal linguistic theory of compositional semantics (Montague grammar).
- Task: Map English sentences to lambda calculus logical forms.
- "The hedgehog ate the cake." → * hedgehog(x1) ; cake(x2) ; ate.agent(x3, x1) AND ate.theme(x3, x2)
- "The girl was helped by the teacher." → * girl(x1) ; teacher(x2) ; help.agent(x3, x2) AND help.theme(x3, x1) (passive)
- Scale: 24,155 training, 21,000 test examples across 21 generalization conditions.
- Coverage: Active/passive voice, relative clauses, PP attachment, recursive embedding.

The Generalization Conditions

COGS tests 21 distinct generalization types:

Lexical Generalization (9 conditions):
- A noun that only appeared as a subject in training appears as an object in test.
- A verb that only appeared in active voice in training appears in passive voice in test.
- A proper name that appeared in one syntactic role (subject) appears in another (indirect object).

Structural Generalization (12 conditions):
- Train on simple sentences, test on sentences with embedded relative clauses: "The girl that the teacher helped ate."
- Train on recursion depth 1, test on depth 3: "The hedgehog ate the cake that the girl baked that the cat scratched."
- Train on PP attachment in subject position, test on PP attachment in object position.

The Core Claim

Two types of language generalization are theoretically required for compositional competence:

1. Lexical Generalization: Understanding "dax" in "The dax was eaten" → dax(x) even though "dax" never appeared as an object-role noun in training.
2. Structural Generalization: Parsing "The girl that the hedgehog helped the cake for ate" — a structure with unseen depth of center-embedding — by applying known rules recursively.

Why Models Fail COGS

- Role-Specific Representations: Standard transformers learn "hedgehog" → {subject role features} and struggle to apply "hedgehog" as an object. True compositionality requires role-independent lexical representations.
- Depth Generalization: Train on depth-1 relative clauses, fail on depth-3 — same pattern as CLUTRR (length generalization), but in syntactic recursion rather than factual chains.
- Training Bias: The training distribution heavily over-represents simple active declarative sentences. Passive, recursive, and PP-attached forms are rarer — any statistical model will consequently under-encode rules for rare forms.

Performance Results

| Model | Lexical Generalization | Structural Generalization | Overall |
|-------|----------------------|--------------------------|---------|
| LSTM seq2seq | ~65% | ~18% | ~35% |
| Transformer | ~75% | ~26% | ~45% |
| Pretrained BART | ~82% | ~41% | ~59% |
| LEAR (specialized) | ~97% | ~78% | ~85% |
| GPT-4 + CoT | ~92% | ~70% | ~82% |

Why COGS Matters

- Formal Linguistic Grounding: Unlike SCAN (toy action commands), COGS uses realistic English grammar and targets logical form representations directly relevant to knowledge graph population, question answering, and text-to-database interfaces.
- Semantic Parsing Implications: COGS failure means that standard seq2seq models trained on SQL generation (NL→SQL) will fail on sentences with novel syntactic structures — a critical reliability concern for text-to-database products.
- Cognitive Science Connection: COGS's generalization conditions map directly onto tests used in psycholinguistics to measure human compositional competence — enabling AI-human comparison.
- Transformer Architecture Insight: COGS results show that transformer attention heads can capture local dependencies well but struggle with long-distance structural dependencies — directly informing architectural improvements.

Connection to SCAN, CFQ, and gSCAN

| Benchmark | Modality | Output Type | Generalization Split Design |
|-----------|---------|------------|---------------------------|
| SCAN | Language | Action sequences | Lexical holdout (verb) |
| gSCAN | Language+Vision | Navigation actions | Concept combination |
| COGS | Language | Logical forms (λ-calculus) | Lexical + structural |
| CFQ | Language | SPARQL queries | Compound structure |

COGS is stress-testing the syntax of meaning — using formal linguistic methods to determine whether AI models have internalized the syntactic rules that generate natural language structure or merely learned statistical co-occurrence patterns that collapse when presented with novel but grammatically valid constructions.

COGS (Compositional Generalization related to Semantic parsing)

Want to learn more?