SWAG (Situations With Adversarial Generations)

SWAG (Situations With Adversarial Generations) is the grounded commonsense inference benchmark — a 113,000-example dataset for predicting which of four sentence continuations is most plausible given a premise drawn from video captions, historically significant as the benchmark that BERT solved immediately upon release in 2018, demonstrating the transformative power of large-scale pre-training and directly motivating the creation of HellaSwag.

Task Definition

SWAG presents a partial sentence (the "activity context") and asks the model to select the most plausible continuation from four options. Examples come from video caption datasets:

Context: "She pours some oil into a pan and turns the stove on."
Choices:
(a) "She then stirs the oil with a spatula." (Correct)
(b) "She then eats the oil directly." (Wrong)
(c) "She then adds the pan to the oil." (Wrong)
(d) "She then turns off the stove and leaves." (Wrong)

The correct completion describes what physically and temporally follows in the activity sequence. Wrong answers are generated to be superficially plausible but physically, causally, or temporally implausible.

Dataset Construction: Adversarial Filtering

SWAG introduced a pioneering adversarial filtering methodology to avoid the annotation artifacts that plagued earlier commonsense benchmarks:

Step 1 — Activity Caption Collection: Captions from two large video datasets — LSMDC (Large Scale Movie Description Challenge) and ActivityNet Captions — provided grounded activity descriptions with naturally occurring temporal sequences.

Step 2 — Negative Generation: Given a correct continuation, a language model (LSTM-based at the time) generated plausible-sounding but incorrect alternative continuations.

Step 3 — Adversarial Filtering: Train a discriminative classifier on the proposed examples. Remove examples where the classifier easily identifies correct vs. incorrect completions. Only examples that survive — where the classifier cannot distinguish correct from incorrect — remain.

The intuition: if a simple model can distinguish correct from incorrect continuations based on superficial features (word frequency, length, style), human annotators might also be using those features rather than genuine inference. Adversarial filtering forces the remaining examples to require genuine commonsense reasoning.

The BERT Moment

SWAG is historically significant as the benchmark BERT solved before the paper's ink was dry. When Devlin et al. released BERT in October 2018, they evaluated on SWAG as part of the initial paper:

| Model | SWAG Accuracy |
|-------|--------------|
| ESIM + ELMo (prior SOTA) | 59.1% |
| Human | 88.0% |
| BERT-base | 81.6% |
| BERT-large | 86.3% |

BERT-large achieved 86.3%, approaching human performance (88%) in a single fine-tuning step. The prior state-of-the-art (ESIM + ELMo) achieved 59.1% — barely above the random 25% baseline for a 4-choice task. BERT's jump of 27 points over the previous best system was the most dramatic single-model improvement in NLP history at that time.

The implication: the adversarial filtering used LSTM-based discriminators. When BERT (a Transformer pre-trained on billions of words) arrived, it could easily learn the residual patterns that the LSTM discriminator missed. SWAG's adversarial filtering was effective against LSTMs but not against BERT.

Why SWAG Was Solved and HellaSwag Was Born

The BERT result revealed a methodological flaw: the adversarial filter must be as strong as the models that will be evaluated on the benchmark. SWAG used LSTMs for filtering; BERT-era Transformers saw through the remaining patterns immediately.

Zellers et al. created HellaSwag (2019) to fix this:
- Used BERT itself as the adversarial discriminator to filter training examples.
- Generated longer, more detailed wrong continuations using a fine-tuned GPT model.
- Achieved a 95%+ human accuracy while reducing BERT-large to 47% accuracy on HellaSwag's test set — barely above random.
- HellaSwag proved that adversarial filtering with strong-enough discriminators creates genuinely hard examples.

SWAG's Enduring Contributions

Despite being quickly solved, SWAG made lasting contributions to NLP:

Benchmark Construction Methodology: Introduced adversarial filtering as a principled technique for benchmark construction, directly inspiring HellaSwag, Winogrande, and AFLite. The core idea — use a model to remove easy examples — became standard practice.

Grounded Commonsense: Established that video captions provide rich, naturalistic sources for activity-sequence commonsense reasoning, grounded in real-world physical and temporal regularities.

Four-Choice Format: Popularized the four-choice format for commonsense inference evaluation, enabling easy automatic scoring without human evaluation of free-form answers.

Scaling Revelation: SWAG's rapid saturation was one of the clearest demonstrations that pre-training scale was the key variable in NLP performance — more predictive than architectural innovations or task-specific engineering.

SWAG in the Evaluation Ecosystem

SWAG is included in many LLM evaluation suites as a historical reference point and for tracking how smaller models perform on commonsense tasks that larger models have saturated. It is often reported alongside HellaSwag to illustrate the difficulty spectrum and the progress of model scaling.

SWAG is the benchmark BERT broke in 2018 — a commonsense inference dataset that documented the most dramatic benchmark saturation event in NLP history, directly motivating the adversarially hardened HellaSwag and establishing that benchmark difficulty must scale with model capability.

SWAG (Situations With Adversarial Generations)

Want to learn more?