Contamination
Benchmark contamination occurs when test data appears in training sets inflating evaluation scores and creating misleading performance claims. This data leakage makes models appear better than they actually are. Contamination sources include web scraping that captures benchmark datasets training on data dumps containing test sets and temporal leakage where future data leaks into training. Detection methods include n-gram overlap analysis checking for exact matches embedding similarity finding near-duplicates and manual inspection. Mitigation strategies include careful data filtering temporal splits ensuring training data predates test data and using held-out private test sets. Contamination is particularly problematic for language models trained on web-scale data where test sets may be inadvertently included. It undermines trust in benchmarks and makes comparing models difficult. Best practices include documenting data sources deduplicating training data and using multiple diverse benchmarks. Contamination checking should be standard practice when evaluating models especially on public benchmarks.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.