Interpretability and Explainability
Why Interpretability? Understanding what models learn and why they make decisions is crucial for trust, debugging, and safety.
Interpretability Levels
| Level | What it Reveals |
|---|---|
| Global | Overall model behavior |
| Local | Individual prediction reasoning |
| Concept | High-level learned representations |
| Mechanistic | Specific circuits and algorithms |
Common Techniques
Attention Visualization See which tokens the model attends to:
import transformers
# Get attention weights
outputs = model(input_ids, output_attentions=True)
attentions = outputs.attentions # List of attention matrices
# Visualize with BertViz or similar
Feature Attribution Which inputs influenced the output:
from captum.attr import IntegratedGradients
ig = IntegratedGradients(model)
attributions = ig.attribute(input_embeddings, target=output_class)
SHAP Values Model-agnostic feature importance:
import shap
explainer = shap.Explainer(model)
shap_values = explainer(inputs)
shap.plots.waterfall(shap_values[0])
LLM-Specific Interpretability
Logit Lens See predictions at intermediate layers:
def logit_lens(model, input_ids, layer_num):
hidden = get_hidden_state(model, input_ids, layer_num)
# Project to vocabulary
logits = model.lm_head(hidden)
return logits.argmax(-1)
Activation Patching Test which components matter:
def patch_activation(model, clean_input, corrupt_input, layer, position):
# Run clean, get activation
clean_activation = get_activation(model, clean_input, layer, position)
# Run corrupt, patch with clean activation
with patch_hook(model, layer, position, clean_activation):
output = model(corrupt_input)
return output
Sparse Autoencoders Learn interpretable features:
class SparseAutoencoder(nn.Module):
def __init__(self, d_model, n_features):
self.encoder = nn.Linear(d_model, n_features)
self.decoder = nn.Linear(n_features, d_model)
def forward(self, x):
# Sparse encoding
features = F.relu(self.encoder(x))
reconstruction = self.decoder(features)
return features, reconstruction
Tools
| Tool | Focus |
|---|---|
| TransformerLens | Mechanistic interpretability |
| Captum | PyTorch attribution |
| SHAP | Feature importance |
| BertViz | Attention visualization |
| Neuroscope | Feature visualization |
Interpretability is an active research area with new methods emerging rapidly.
interpretabilityexplainabilityxai
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.