Winogrande

Winogrande is the large-scale, adversarially filtered commonsense reasoning benchmark — a 44,000-example successor to the Winograd Schema Challenge (WSC) that was specifically designed to eliminate the annotation artifacts and statistical shortcuts that allowed models to achieve high scores on the original WSC without genuine commonsense reasoning.

The Original Winograd Schema Challenge

The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011), was designed as an alternative to the Turing Test. Each schema presents a sentence with an ambiguous pronoun that can only be resolved through commonsense reasoning:

"The trophy didn't fit in the suitcase because it was too big. What was too big?" → The trophy (not the suitcase).
"The trophy didn't fit in the suitcase because it was too small. What was too small?" → The suitcase (not the trophy).

The correct resolution requires knowing that "too big" makes the container the bottleneck and "too small" makes the container the limitation — a subtle inference requiring world knowledge about spatial containment.

The original WSC had only 273 examples — far too small for training neural networks and susceptible to memorization. More critically, models achieved high WSC accuracy by exploiting simple word co-occurrence statistics in training data rather than genuine reasoning.

Winogrande's Design Innovations

Scale: 44,000 examples created through crowdsourcing on Amazon Mechanical Turk — 160x more examples than the original WSC, enabling both training and evaluation at scale.

Two-Blank Format: Unlike WSC (which asks "what does the pronoun refer to?"), Winogrande uses a fill-in-the-blank format:
"Sarah was a much better athlete than Mary, so [_] often asked for advice."
Choices: (a) Sarah (b) Mary
Correct: (b) Mary — because better athletes are sought for advice, not the reverse.

AFLite (Adversarial Filtering Lite): The key innovation. After crowdsourcing 60,000+ raw examples, AFLite automatically identifies and removes examples where simple statistical models (feature-based classifiers using word co-occurrence statistics) achieve high accuracy. Only examples that survive this filtering — those that require genuine reasoning rather than statistical shortcuts — remain in the final dataset.

AFLite process:
1. Train multiple simple classifiers on feature representations of all examples.
2. Identify examples where classifiers achieve high agreement (easy examples exploitable by statistics).
3. Remove easy examples iteratively until the remaining set cannot be solved by statistical models above chance.
4. Final dataset: ~44,000 examples where simple shortcuts fail.

Task Format and Evaluation

- Input: Sentence with a blank (_) and two noun phrase choices.
- Output: Select the choice that correctly fills the blank based on commonsense inference.
- Metric: Binary accuracy (random baseline: 50%).
- Human performance: ~94% accuracy (crowdworkers who did not create the examples).
- Dataset splits: Training sets of various sizes (xs: 160, s: 640, m: 2,558, l: 5,120, xl: 12,800, full: 40,398) to study data efficiency.

Benchmark Results and Scaling

| Model | Winogrande Accuracy |
|-------|-------------------|
| BERT-large | 73.9% |
| RoBERTa-large | 79.1% |
| GPT-3 (0-shot) | 70.2% |
| GPT-3 (few-shot) | 77.7% |
| UnifiedQA-11B | 84.9% |
| Human | 94.1% |

The persistent gap between model and human performance (even for very large models) demonstrates that Winogrande's adversarial filtering successfully created examples that require genuine reasoning.

What Winogrande Tests

Winogrande examples cluster into commonsense categories:
- Social and Motivational: "Because [_] was nervous, they spoke softly at the party." Requires understanding social dynamics.
- Physical: "The vase fell off the shelf because [_] was fragile." Physical causality.
- Causal: "The car started after [_] put in the key." Causal sequences.
- Comparative: "Amy is shorter than Beth, so [_] can fit in the small car more easily." Comparative reasoning.

AFLite and the Shortcut Learning Problem

Winogrande's most important contribution may be methodological: demonstrating that adversarial dataset filtering is a practical tool for creating harder, more genuine reasoning benchmarks. The AFLite algorithm showed:

- Standard crowdsourced datasets inevitably contain exploitable annotation artifacts.
- Simple classifiers can identify and remove these artifacts automatically.
- Models trained on AFLite-filtered data generalize better to novel examples than models trained on unfiltered data.

AFLite's approach has been applied to create harder variants of other benchmarks, making the methodology broadly influential beyond Winogrande itself.

Winogrande in the Context of Larger Benchmarks

Winogrande is included in:
- BIG-Bench: As one of 204 challenging tasks.
- SuperGLUE-inspired evaluations: Commonsense reasoning track.
- LLM evaluation suites: Standard component of evaluating GPT-4, Claude, Llama, and Gemini capabilities.

Winogrande is the adversarially hardened reasoning test — a fill-in-the-blank benchmark that uses automated filtering to eliminate statistical shortcuts, ensuring that high performance requires genuine commonsense inference rather than the exploitation of dataset-construction artifacts that plagued earlier WSC evaluations.

Want to learn more?