Adversarial Robustness

Adversarial Robustness is the study and engineering of deep learning models that maintain correct predictions when inputs are perturbed by small, carefully crafted adversarial perturbations — imperceptible modifications designed to cause misclassification — encompassing attack methodologies that expose vulnerabilities, empirical defenses that harden models through adversarial training, and certified defenses that provide mathematical guarantees on worst-case performance.

Attack Taxonomy:
- FGSM (Fast Gradient Sign Method): Single-step attack adding epsilon-scaled sign of the loss gradient to the input; fast but relatively weak
- PGD (Projected Gradient Descent): Multi-step iterative attack repeatedly applying small FGSM steps and projecting back onto the epsilon-ball; the standard benchmark attack
- C&W (Carlini & Wagner): Optimization-based attack minimizing perturbation magnitude while ensuring misclassification; effective against many defenses but computationally expensive
- AutoAttack: Ensemble of complementary attacks (APGD-CE, APGD-DLR, FAB, Square) providing a reliable, parameter-free robustness evaluation standard
- Patch Attacks: Modify a localized region (physical sticker, printed pattern) to cause misclassification in real-world settings
- Universal Adversarial Perturbations: Find a single perturbation that fools the model on most inputs, revealing systematic blind spots

Threat Models:
- Lp-Norm Bounded: Perturbations constrained within an Lp ball — L-infinity (max per-pixel change, typically epsilon=8/255 for CIFAR-10), L2 (Euclidean distance), or L1 (sparse perturbations)
- Semantic Perturbations: Physically realizable changes like rotation, color shifts, lighting variations, or weather effects that preserve human interpretation
- Black-Box Attacks: Adversary has no access to model weights; relies on transfer attacks (craft adversarial examples on a surrogate model), query-based attacks, or score-based attacks
- White-Box Attacks: Full access to model architecture, weights, and gradients — the strongest threat model used for rigorous robustness evaluation

Empirical Defenses — Adversarial Training:
- Standard Adversarial Training (Madry et al.): Replace clean training examples with PGD-adversarial examples; the most reliable empirical defense but incurs 3–10x training cost
- TRADES: Decomposes the robust optimization objective into natural accuracy and boundary robustness terms with a tunable tradeoff parameter
- AWP (Adversarial Weight Perturbation): Perturb model weights during adversarial training to flatten the loss landscape and improve generalization
- Friendly Adversarial Training (FAT): Use early-stopped PGD to find adversarial examples near the decision boundary rather than worst-case, reducing overfitting
- Accuracy-Robustness Tradeoff: Adversarially trained models typically sacrifice 5–15% clean accuracy for substantially improved robust accuracy

Certified Defenses:
- Randomized Smoothing: Create a smoothed classifier by averaging predictions over Gaussian noise perturbations of the input; provides L2 certified radii via Neyman-Pearson lemma
- Interval Bound Propagation (IBP): Propagate interval bounds through each network layer to compute guaranteed output bounds for all inputs within the perturbation set
- Linear Relaxation (CROWN, alpha-CROWN): Compute linear upper and lower bounds on network outputs using convex relaxations of nonlinear activations
- Lipschitz Networks: Constrain the Lipschitz constant of each layer (spectral normalization, orthogonal layers) to provably limit output change per unit input perturbation
- Certification Gap: Certified radii are typically smaller than empirical robustness — closing this gap remains an active research challenge

Evaluation Best Practices:
- Use AutoAttack: The standard evaluation suite that prevents overestimating robustness due to gradient masking or obfuscated gradients
- Report Clean and Robust Accuracy: Always measure both natural accuracy and accuracy under attack at the specified epsilon
- Adaptive Attacks: Design attacks specifically targeting the defense mechanism's unique properties; generic attacks may miss exploitable weaknesses
- RobustBench: Standardized benchmark tracking adversarial robustness across models, datasets, and threat models with consistent evaluation protocols

Adversarial robustness remains one of the fundamental open challenges in deploying deep learning to safety-critical domains — where the gap between empirical defenses and provable guarantees, the inherent accuracy-robustness tradeoff, and the computational cost of robust training must all be navigated to build trustworthy AI systems.

Want to learn more?