Obfuscated gradients

Obfuscated gradients are a class of adversarial defense mechanisms that make gradient-based attacks harder by breaking or masking the gradient signal used to craft adversarial examples — including non-differentiable preprocessing, stochastic components, or deeply stacked defense networks that cause gradient computation to fail or produce uninformative gradients, but which are typically vulnerable to adaptive attacks that bypass gradient computation entirely, providing a false sense of robustness unless rigorously evaluated with adaptive attack methods.

Why Gradients Matter for Adversarial Attacks

The most effective adversarial attacks (PGD, C&W, AutoAttack) use the model's own gradients to find the smallest perturbation δ that causes misclassification:

max_{||δ||≤ε} L(f(x + δ), y_true)

This is solved via projected gradient descent: δ_{t+1} = Π_{||δ||≤ε}[δ_t + α · sign(∇_δ L)].

The attack requires meaningful gradients ∇_δ L. Obfuscated gradient defenses aim to make this gradient signal uninformative or non-existent.

Three Types of Obfuscated Gradients

Type 1 — Shattered Gradients: Non-differentiable preprocessing transforms the input before the classifier sees it, breaking the gradient path:
- JPEG compression (discrete quantization)
- Pixel value rounding or discretization
- Random bit-depth reduction
- Thermometer encoding

Attacks using straight-through gradient estimation treat the non-differentiable operation as an identity during backpropagation. Because the true gradient is zero almost everywhere but the operation has a meaningful input-output relationship, standard attackers fail while adaptive attackers succeed.

Type 2 — Stochastic Defenses: Randomness in the defense prevents gradient ascent from converging:
- Random resizing and padding of input images
- Feature squeezing with random noise injection
- Randomized smoothing (deliberately adds Gaussian noise)
- Dropout active during inference
- Stochastic neural network ensembles

Expectation Over Transformation (EOT) attacks defeat stochastic defenses by optimizing the expected loss over many random samples: max E_{t~T}[L(f(t(x+δ))], averaging gradients over the randomness distribution.

Type 3 — Exploding/Vanishing Gradients from Deep Defenses: Defense networks that are themselves deep (input transformers, purifiers, denoising networks) may produce vanishing or exploding gradients through their layers, making the end-to-end gradient uninformative:
- Deep input purification networks
- Defense-in-depth architectures
- Gradient masking through sigmoid/tanh saturation

BPDA (Backward Pass Differentiable Approximation) replaces the defense component with a smooth approximation during the backward pass only, recovering meaningful gradients for the attack.

Athalye et al. (2018): Obfuscated Gradients Give False Security

The landmark paper examined nine ICLR 2018 defense papers and found that seven relied on obfuscated gradients for apparent robustness. Using adaptive attacks (BPDA, EOT, or combinations), the paper broke all seven defenses — reducing accuracy from the claimed 50-90% under attack to near 0-20%.

Diagnostic signs that a defense uses obfuscated gradients:
- Attack success rate decreases as attack iteration count increases (should be monotone increasing for valid defenses)
- White-box attacks are less successful than black-box transfer attacks (gradient-based attack fails, but transferability remains)
- Random perturbations cause accuracy drops similar to adversarial perturbations

Certified vs. Heuristic Defenses

The obfuscated gradients problem motivates the distinction:

| Defense Type | Robustness Guarantee | Representative Method |
|-------------|---------------------|----------------------|
| Certified defenses | Provable — verification algorithm guarantees | Randomized Smoothing, Lipschitz constraints, IBP training |
| Heuristic defenses | Empirical — no worst-case guarantee | Adversarial training (PGD-AT), TRADES |
| Obfuscated gradient defenses | Apparent only — breaks under adaptive attacks | Input preprocessing, stochastic defenses without EOT evaluation |

Best Practices for Defense Evaluation

The adversarial ML community now requires:
1. Evaluate with AutoAttack (ensemble of diverse attacks including black-box)
2. Test with adaptive attacks specifically designed to break the defense
3. Provide certified accuracy bounds where possible
4. Release code for independent verification
5. Report against established benchmarks (RobustBench) rather than custom evaluation protocols

Randomized Smoothing (Cohen et al., 2019) is the only certified defense that scales to ImageNet, providing provable ε-ball robustness guarantees at the cost of accuracy on clean inputs.

Want to learn more?