gSCAN (grounded SCAN)

gSCAN (grounded SCAN) is the benchmark for systematically testing compositional generalization in visually grounded instruction following — placing an agent in a grid world where it must execute commands like "walk to the small red circle," with test splits specifically designed so that novel concept combinations (e.g., "yellow circle" when yellow objects and circles were trained separately) expose whether the model truly understands each concept independently or merely memorizes training pairs.

What Is gSCAN?

- Origin: Developed by Ruis et al. (2020), extending the SCAN benchmark with visual grounding.
- Grid World: 6×6 grid containing colored shapes (circles, squares, cylinders) in multiple sizes (small, medium, large).
- Commands: Natural language instructions like "push the small red square cautiously" → action sequence in the grid world.
- Compositional Structure: Commands combine a verb (walk/push/pull), adverb (cautiously/hesitantly), size adjective, color adjective, and shape noun — allowing systematic manipulation of concept combinations.
- Scale: ~867,000 training examples; 6 test splits targeting different generalization conditions.

The 6 Generalization Splits

Split A — Random: Standard train/test split. Establishes the baseline performance ceiling.

Split B — Yellow Circles: Yellow objects and circles appear separately in training. Test requires "yellow circle" instructions — testing attribute composition.

Split C — Red Squares: Similar to B but with a different combination.

Split D — Novel Direction: The agent always starts facing south in training. Test has the agent facing north, east, or west — tests direction invariance.

Split E — Relative Clause: Commands with relative clauses ("push the circle to the right of the square") are held out from training.

Split F — Class Label Consistency: Objects of a specific class appear consistently on one side of the grid in training. Tests whether models exploit positional shortcuts rather than object identity.

gSCAN Results Across Models

| Model | Split A | Split B (yellow circle) | Split D |
|-------|---------|------------------------|---------|
| Seq2Seq + attention | ~98% | ~15% | ~15% |
| Compositional Model | ~98% | ~83% | ~91% |
| GPT-4 (zero-shot) | ~75% | ~52% | ~63% |

The catastrophic failure on Split B (yellow circle) — a combination trivially understood by humans — is gSCAN's central finding.

Why gSCAN Matters

- Visual Compositionality: Combining a color and a shape should not require seeing the specific color-shape combination during training. gSCAN quantifies how far neural models fall short of this intuitive requirement.
- Grounding vs. Language-Only: Unlike SCAN (text-only), gSCAN grounds language in actual visual scenes, connecting the compositionality problem to robotics and embodied AI.
- Robotics Transfer: A household robot given "pick up the blue mug" when it only trained on "pick up the blue plate" and "pick up the red mug" should generalize. gSCAN measures this capacity.
- Shortcut Detection: The positional-bias split (F) reveals that models will exploit non-semantic regularities (objects are always on the left in training) rather than learning the underlying compositional semantics.
- Architecture Motivation: gSCAN failure drove development of modular networks, disentangled representation learning, and structured prediction architectures that explicitly separate attribute and relation representations.

Comparison to SCAN and COGS

| Benchmark | Grounded | Vision | Instruction Type | Size |
|-----------|---------|--------|-----------------|------|
| SCAN | No | No | Action sequences | 20k |
| gSCAN | Yes | Grid world | Navigation + manipulation | 867k |
| COGS | No | No | Semantic parsing (logical forms) | 24k |

gSCAN is the unobserved combination test for embodied AI — measuring whether an agent that has learned "yellow objects" and "circles" separately can immediately understand instructions involving "yellow circles," directly probing the compositional generalization gap that separates human-like concept formation from statistical pattern matching in grounded neural agents.

Want to learn more?