Data Poisoning is the adversarial attack that corrupts machine learning models by injecting malicious examples into training data — exploiting the fundamental dependence of ML systems on training data integrity to degrade model performance, embed backdoors, or manipulate predictions toward attacker-specified targets, without requiring access to the model itself during deployment.
What Is Data Poisoning?
- Definition: An adversary with write access to the training data (or the ability to influence what data is collected) injects crafted malicious examples that cause the trained model to behave in attacker-desired ways — degrading accuracy, creating backdoors, or causing targeted misclassifications.
- Attack Surface: Training data collection via web scraping, crowdsourced labeling platforms (Amazon Mechanical Turk), public datasets, federated learning data contributions, or data marketplaces — any untrusted data source is a potential poisoning vector.
- Distinction from Adversarial Examples: Adversarial examples attack models at inference time. Data poisoning attacks models at training time — corrupting the model itself rather than individual inputs.
- Scale of Threat: LAION-5B (used to train Stable Diffusion, CLIP) contains billions of image-text pairs from the public internet — any adversary who can host images and control associated text can influence model training at scale.
Types of Data Poisoning Attacks
Availability Attacks (Denial of Service):
- Goal: Degrade overall model accuracy on clean test data.
- Method: Inject randomly labeled or adversarially crafted examples.
- Indiscriminate — reduces model utility for all users.
- Easiest to detect (validation accuracy drops).
Integrity Attacks (Targeted):
- Goal: Cause specific misclassification on target inputs while maintaining clean accuracy.
- Method: Carefully craft poison examples that push decision boundaries toward desired misclassification.
- Subtle — validation accuracy remains high.
- Harder to detect.
Backdoor Attacks:
- Goal: Embed hidden trigger-activated behavior.
- Method: Poison training data with trigger+target label pairs.
- Invisible — only activates on trigger inputs; clean accuracy unaffected.
- Most dangerous variant.
Poisoning in Specific Settings
Web-Scraped Pre-training Data:
- Carlini et al. (2023): Demonstrated practical poisoning of CLIP-scale models via poisoning of public datasets by hosting malicious images.
- "Nightshade" (Shan et al.): Artists can add imperceptible perturbations to their images that, when scraped into training data, cause generative models to associate concepts incorrectly.
- "Glaze": Similar protective poisoning to mask artistic style from being learned by generative models.
Federated Learning Poisoning:
- Compromised participant sends poisoned gradient updates.
- Model-poisoning: Directly manipulate gradient to embed backdoor (Bagdasaryan et al.).
- Data poisoning: Local training on poisoned data; gradient updates propagate poison.
LLM Training Data Poisoning:
- Instruction tuning data from the internet can be poisoned by adversaries who control web content.
- "Shadow Alignment" (Yang et al. 2023): Showed that injecting ≤100 malicious examples into fine-tuning data can jailbreak safety-trained LLMs.
- RAG Poisoning: Inject adversarial documents into retrieval databases to manipulate LLM responses.
Detection and Defense
Data Sanitization:
- Outlier detection: Remove training examples that are statistical outliers in feature space (high KNN distance from clean data).
- Clustering: Separate clean from poisoned examples using activation clustering (Chen et al.).
- Spectral signatures: Poisoned examples leave linear traces in feature covariance (Tran et al.).
Certified Defenses:
- Randomized ablation (Levine & Feizi): Certify robustness to poisoning within a given fraction of training data.
- DPA (Deep Partition Aggregation): Certified defense against arbitrary poison fractions.
Data Provenance:
- Cryptographic hashing: Verify dataset integrity against signed checksums.
- Data lineage tracking: Record where each training example originated.
- SBOMs for AI: Software Bill of Materials extended to training data and model components.
Poisoning Resistance through Architecture:
- Data-efficient training: Less data dependence reduces poisoning leverage.
- Differential privacy (DP-SGD): Limits per-example influence on model parameters — provably bounds poisoning impact.
- Robust aggregation (in federated settings): Coordinate-wise median, Krum, FLTrust — robust to Byzantine participant contributions.
Data poisoning is the training-time attack that corrupts AI at its foundation — while adversarial examples require attacker access at inference time, data poisoning requires only the ability to influence what data enters the training pipeline, making it a realistic threat for any organization relying on internet-scraped, crowdsourced, or federated training data without cryptographic integrity verification.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.