Integrated Gradients

Integrated Gradients is an attribution method that assigns importance scores to input features by accumulating gradients along a straight-line path from a baseline to the actual input — satisfying key axioms (completeness, sensitivity) that vanilla gradients violate.

How Integrated Gradients Works

- Baseline: A reference input $x'$ (typically all zeros, black image, or PAD tokens).
- Path: Interpolate linearly from $x'$ to $x$: $x(alpha) = x' + alpha(x - x')$ for $alpha in [0,1]$.
- Integration: $IG_i = (x_i - x_i') int_0^1 frac{partial F(x(alpha))}{partial x_i} dalpha$ — accumulated gradient × input difference.
- Approximation: Approximate the integral with a Riemann sum using 20-300 interpolation steps.

Why It Matters

- Completeness Axiom: Attributions sum exactly to the difference $F(x) - F(x')$ — every bit of the prediction is accounted for.
- Sensitivity: If a feature matters (changing it changes the prediction), it gets non-zero attribution.
- Implementation: Simple to implement — just requires gradient computation at interpolated inputs.

Integrated Gradients is following the gradient along the path — accumulating feature importance from a baseline to the input for principled, complete attribution.

Want to learn more?