Home Knowledge Base Adversarial Examples

Adversarial Examples are inputs crafted with imperceptible perturbations that reliably fool machine learning models into confident incorrect predictions — revealing that neural networks classify based on brittle, high-frequency statistical patterns rather than human-meaningful semantic features, posing fundamental security and safety challenges for AI deployed in adversarial environments.

What Are Adversarial Examples?

Why Adversarial Examples Matter

Attack Types

White-Box Attacks (full model access):

FGSM (Fast Gradient Sign Method): δ = ε × sign(∇_x L(f(x), y)) Single gradient step in direction that maximizes loss. Fast but weak.

PGD (Projected Gradient Descent — Madry et al.): x_t+1 = Π_{ε-ball}(x_t + α × sign(∇_x L(f(x_t), y))) Iterative FGSM with projection back to ε-ball. Stronger and considered gold standard attack.

C&W Attack (Carlini & Wagner): Minimizes perturbation magnitude while finding misclassification. Formulates as optimization: min ||δ||_2 s.t. f(x+δ) ≠ y. Most powerful white-box attack; designed to break defensive distillation.

Black-Box Attacks (only query access):

Transfer attacks: Craft adversarial examples on surrogate model; transfer to target. Query-based: Estimate gradients through model queries (SPSA, Square Attack). Score-based vs. Decision-based: Whether attack has access to confidence scores or only top-1 class.

Targeted vs. Untargeted:

Why Neural Networks Are Vulnerable

The Linearity Hypothesis (Goodfellow et al.): High-dimensional linear functions (which deep networks approximate locally) are inherently sensitive to adversarial perturbations. A small change ε in each of D dimensions accumulates to εD total effect — large for high-dimensional inputs.

Feature Statistics vs. Semantics: Neural networks classify based on statistical patterns (texture, frequency content) that humans ignore — models trained with ERM learn the most predictive features, not the most robust ones.

Ilyas et al. (2019): Adversarial features are actually predictive of class labels — they are not bugs in the model but features of the data distribution that are non-robust but genuinely informative.

Common Perturbation Norms

NormMeaningTypical ε
L∞Max pixel change4/255 to 16/255
L2Total pixel energy0.5 to 3.0
L0Number of pixels changed1-100 pixels
LpGeneral MinkowskiTask-dependent

Adversarial examples are the security vulnerability that reveals neural networks as sophisticated pattern-matchers rather than genuine understanders — their existence forces AI researchers to confront the gap between human perception and model decision-making, driving an ongoing arms race between attack methods and defenses that remains one of the most active areas in trustworthy machine learning.

adversarial exampleperturbationattack

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.