Mechanism

Adversarial training improves model robustness by including adversarial examples during training. Mechanism: Generate adversarial perturbations of training examples, add perturbed examples to training batch, model learns to correctly classify both clean and adversarial inputs. Process: For each batch: compute loss, generate adversarial perturbation (FGSM, PGD), compute loss on perturbed input, update on combined loss. PGD adversarial training: Multi-step projected gradient descent for stronger attacks during training. Considered gold standard. Benefits: Most reliable defense against gradient-based attacks, improves robustness certification, may improve generalization. Trade-offs: 2-10x slower training, slight accuracy drop on clean data, robustness-accuracy tradeoff, doesn't protect against all attack types. For NLP: Data augmentation with adversarial text, TextFooler-augmented training, synonym substitution during training. Challenges: Robust overfitting (robustness decreases late training), choosing attack strength, computational cost. Best practices: Use strong attacks, early stopping on robust accuracy, combine with other defenses. Most reliable approach to achieving adversarial robustness.

Want to learn more?