Home Knowledge Base Interpretability and Explainability

Interpretability and Explainability

Why Interpretability? Understanding what models learn and why they make decisions is crucial for trust, debugging, and safety.

Interpretability Levels

LevelWhat it Reveals
GlobalOverall model behavior
LocalIndividual prediction reasoning
ConceptHigh-level learned representations
MechanisticSpecific circuits and algorithms

Common Techniques

Attention Visualization See which tokens the model attends to:

import transformers

# Get attention weights
outputs = model(input_ids, output_attentions=True)
attentions = outputs.attentions  # List of attention matrices

# Visualize with BertViz or similar

Feature Attribution Which inputs influenced the output:

from captum.attr import IntegratedGradients

ig = IntegratedGradients(model)
attributions = ig.attribute(input_embeddings, target=output_class)

SHAP Values Model-agnostic feature importance:

import shap

explainer = shap.Explainer(model)
shap_values = explainer(inputs)
shap.plots.waterfall(shap_values[0])

LLM-Specific Interpretability

Logit Lens See predictions at intermediate layers:

def logit_lens(model, input_ids, layer_num):
    hidden = get_hidden_state(model, input_ids, layer_num)
    # Project to vocabulary
    logits = model.lm_head(hidden)
    return logits.argmax(-1)

Activation Patching Test which components matter:

def patch_activation(model, clean_input, corrupt_input, layer, position):
    # Run clean, get activation
    clean_activation = get_activation(model, clean_input, layer, position)

    # Run corrupt, patch with clean activation
    with patch_hook(model, layer, position, clean_activation):
        output = model(corrupt_input)

    return output

Sparse Autoencoders Learn interpretable features:

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model, n_features):
        self.encoder = nn.Linear(d_model, n_features)
        self.decoder = nn.Linear(n_features, d_model)

    def forward(self, x):
        # Sparse encoding
        features = F.relu(self.encoder(x))
        reconstruction = self.decoder(features)
        return features, reconstruction

Tools

ToolFocus
TransformerLensMechanistic interpretability
CaptumPyTorch attribution
SHAPFeature importance
BertVizAttention visualization
NeuroscopeFeature visualization

Interpretability is an active research area with new methods emerging rapidly.

interpretabilityexplainabilityxai

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.