Home Knowledge Base Winogrande

Winogrande is the large-scale, adversarially filtered commonsense reasoning benchmark — a 44,000-example successor to the Winograd Schema Challenge (WSC) that was specifically designed to eliminate the annotation artifacts and statistical shortcuts that allowed models to achieve high scores on the original WSC without genuine commonsense reasoning.

The Original Winograd Schema Challenge

The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011), was designed as an alternative to the Turing Test. Each schema presents a sentence with an ambiguous pronoun that can only be resolved through commonsense reasoning:

"The trophy didn't fit in the suitcase because it was too big. What was too big?" → The trophy (not the suitcase). "The trophy didn't fit in the suitcase because it was too small. What was too small?" → The suitcase (not the trophy).

The correct resolution requires knowing that "too big" makes the container the bottleneck and "too small" makes the container the limitation — a subtle inference requiring world knowledge about spatial containment.

The original WSC had only 273 examples — far too small for training neural networks and susceptible to memorization. More critically, models achieved high WSC accuracy by exploiting simple word co-occurrence statistics in training data rather than genuine reasoning.

Winogrande's Design Innovations

Scale: 44,000 examples created through crowdsourcing on Amazon Mechanical Turk — 160x more examples than the original WSC, enabling both training and evaluation at scale.

Two-Blank Format: Unlike WSC (which asks "what does the pronoun refer to?"), Winogrande uses a fill-in-the-blank format: "Sarah was a much better athlete than Mary, so [_] often asked for advice." Choices: (a) Sarah (b) Mary Correct: (b) Mary — because better athletes are sought for advice, not the reverse.

AFLite (Adversarial Filtering Lite): The key innovation. After crowdsourcing 60,000+ raw examples, AFLite automatically identifies and removes examples where simple statistical models (feature-based classifiers using word co-occurrence statistics) achieve high accuracy. Only examples that survive this filtering — those that require genuine reasoning rather than statistical shortcuts — remain in the final dataset.

AFLite process: 1. Train multiple simple classifiers on feature representations of all examples. 2. Identify examples where classifiers achieve high agreement (easy examples exploitable by statistics). 3. Remove easy examples iteratively until the remaining set cannot be solved by statistical models above chance. 4. Final dataset: ~44,000 examples where simple shortcuts fail.

Task Format and Evaluation

Benchmark Results and Scaling

ModelWinogrande Accuracy
BERT-large73.9%
RoBERTa-large79.1%
GPT-3 (0-shot)70.2%
GPT-3 (few-shot)77.7%
UnifiedQA-11B84.9%
Human94.1%

The persistent gap between model and human performance (even for very large models) demonstrates that Winogrande's adversarial filtering successfully created examples that require genuine reasoning.

What Winogrande Tests

Winogrande examples cluster into commonsense categories:

AFLite and the Shortcut Learning Problem

Winogrande's most important contribution may be methodological: demonstrating that adversarial dataset filtering is a practical tool for creating harder, more genuine reasoning benchmarks. The AFLite algorithm showed:

AFLite's approach has been applied to create harder variants of other benchmarks, making the methodology broadly influential beyond Winogrande itself.

Winogrande in the Context of Larger Benchmarks

Winogrande is included in:

Winogrande is the adversarially hardened reasoning test — a fill-in-the-blank benchmark that uses automated filtering to eliminate statistical shortcuts, ensuring that high performance requires genuine commonsense inference rather than the exploitation of dataset-construction artifacts that plagued earlier WSC evaluations.

winograndeevaluation

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.