Integrated Gradients is the axiomatic attribution method that explains neural network predictions by summing gradients along the path from a baseline input to the actual input — satisfying provable mathematical properties (sensitivity and implementation invariance) that simpler gradient methods violate, making it the gold standard for feature attribution in high-stakes applications.
What Are Integrated Gradients?
- Definition: An attribution method that assigns importance scores to input features by integrating (summing) the gradient of the prediction with respect to each feature along a linear interpolation path from a baseline input (e.g., black image, zero embedding) to the actual input.
- Publication: "Axiomatic Attribution for Deep Networks" — Sundararajan, Taly, Yan (Google, 2017).
- Formula: IG_i(x) = (x_i - x'_i) × ∫₀¹ [∂F(x' + α(x - x')) / ∂x_i] dα
Where x' = baseline, x = actual input, α parameterizes the interpolation path.
- Approximation: Discretize the integral with N steps (typically N=50–300): IG_i ≈ (x_i - x'_i) × Σ [∂F(x' + (k/N)(x - x')) / ∂x_i] / N.
Why Integrated Gradients Matters
- Axiom Satisfaction: The only method provably satisfying both Sensitivity (if a feature changes the output, it gets non-zero attribution) and Implementation Invariance (two functionally identical networks get identical attributions).
- Vanilla Gradient Failure: Simple gradients fail Sensitivity — saturated neurons (ReLU past activation threshold) have zero gradient even if changing the feature dramatically changes output. Integrated Gradients averages over the full activation path, capturing saturation.
- Completeness: Attributions sum exactly to the prediction score difference from baseline: Σ IG_i(x) = F(x) - F(x'). Every point of the output difference is "accounted for" by input features.
- Trustworthy in High Stakes: Medical, legal, and financial applications require attributions that are provably correct — not heuristic approximations that look reasonable but may be faithless.
- Standard in Industry: Used by Google (AI Explanations API), AWS (SageMaker Clarify), and Anthropic for explaining transformer model predictions.
The Baseline Choice
The baseline x' is the "neutral" input from which attribution is measured:
| Modality | Common Baseline | Rationale |
|---|---|---|
| Images | Black image (zeros) | No visual information |
| Text (embeddings) | Zero embedding vector | No semantic content |
| Text (tokens) | Padding token [PAD] | Empty/absent input |
| Tabular | Feature means | Average input |
| Audio | Silence (zeros) | No signal |
Baseline choice affects attributions significantly — different baselines answer different questions:
- Black image baseline: "Compared to no image, which pixels mattered?"
- Blurred image baseline: "Compared to a blurred version, which details mattered?"
- Choosing meaningful baselines is an application-specific decision.
Computing Integrated Gradients
def integrated_gradients(model, input_x, baseline_x, n_steps=300):
# Create interpolated inputs along path
alphas = torch.linspace(0, 1, n_steps)
interpolated = baseline_x + alphas.view(-1,1) * (input_x - baseline_x)
# Compute gradients at each interpolation step
grads = []
for interp in interpolated:
interp.requires_grad_(True)
output = model(interp)
output.backward()
grads.append(interp.grad.clone())
# Integrate: average gradients, scale by (input - baseline)
avg_grads = torch.stack(grads).mean(dim=0)
integrated_grads = (input_x - baseline_x) * avg_grads
return integrated_grads
Applications
- Medical Imaging: Attribute cancer diagnosis to specific image regions — meeting the faithfulness bar required for FDA review.
- NLP Sentiment: Identify which words drove positive/negative classification — with completeness guarantees that simpler methods lack.
- Drug Discovery: Attribute molecular toxicity predictions to specific atoms — guiding medicinal chemists toward safer modifications.
- Code Generation: Identify which prompt tokens most influenced generated code — useful for prompt optimization.
Integrated Gradients vs. Other Attribution Methods
| Method | Sensitivity Axiom | Completeness | Baseline Required | Speed |
|---|---|---|---|---|
| Vanilla Gradient | Fails | No | No | Very fast |
| Gradient × Input | Partial | No | No | Very fast |
| Guided Backprop | Fails (faithless) | No | No | Fast |
| Integrated Gradients | Yes | Yes | Yes | Moderate |
| SHAP (KernelSHAP) | Yes | Yes | Yes | Slow |
| SHAP (GradientSHAP) | Approximate | Approximate | Yes | Moderate |
Integrated Gradients is the attribution method with mathematical guarantees that high-stakes applications require — by ensuring that feature attributions are provably faithful to the model's computation rather than plausible-but-arbitrary post-hoc stories, IG provides the rigorous explanatory foundation that enables trusted deployment of neural networks in medicine, law, and finance.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.