TCAV (Testing with Concept Activation Vectors)

TCAV (Testing with Concept Activation Vectors) is the high-level explainability method that tests how much a neural network relies on human-interpretable concepts — going beyond pixel/token attribution to reveal whether models use meaningful semantic concepts (stripes, wheels, medical symptoms) rather than arbitrary low-level patterns to make predictions.

What Is TCAV?

- Definition: An interpretability method that measures a model's sensitivity to a human-defined concept by learning a "Concept Activation Vector" (CAV) from concept examples and testing how strongly the model's predictions change when inputs are perturbed along that concept direction.
- Publication: "Interpretability Beyond Classification Scores" — Kim et al., Google Brain (2018).
- Core Question: Not "which pixels mattered?" but "does this model use the concept of stripes to classify zebras?"
- Input: A set of concept examples ("striped patterns"), a set of random non-concept examples, the model to explain, and a class of interest ("Zebra").
- Output: TCAV score (0–1) — how sensitive the model's prediction is to the concept direction.

Why TCAV Matters

- Human-Level Concepts: Pixel-level explanations (saliency maps) are unintuitive — "the model looked at these pixels" doesn't tell a domain expert whether the model uses relevant medical findings or spurious artifacts.
- Scientific Validation: Test whether AI systems use the same diagnostic concepts as expert humans — if a radiology model uses "mass with irregular border" (correct) vs. "image brightness" (spurious), TCAV distinguishes these.
- Bias Detection: Test whether models rely on protected concepts (skin tone, gender-coded features) rather than medically relevant findings.
- Model Comparison: Compare multiple models on the same concept — does Model A rely on "cellular morphology" more than Model B for cancer detection?
- Concept-Guided Debugging: If a model's TCAV score for a spurious concept is high, the training data likely has a spurious correlation that should be corrected.

How TCAV Works

Step 1 — Define a Human Concept:
- Collect 50–200 images/examples that clearly exhibit the concept (e.g., images of striped patterns, or medical images with a specific finding).
- Also collect random non-concept examples for contrast.

Step 2 — Learn the Concept Activation Vector (CAV):
- Run all concept and non-concept examples through the network.
- Extract activations at a chosen layer L for each example.
- Train a linear classifier (logistic regression) to distinguish concept vs. non-concept activations.
- The linear classifier's weight vector is the CAV — a direction in layer L's activation space corresponding to the concept.

Step 3 — Compute TCAV Score:
- For a set of test images of class C (e.g., "Zebra"):
- Compute the directional derivative of the class prediction with respect to the CAV direction.
- TCAV score = fraction of test images where moving activations along the CAV direction increases class C probability.
- TCAV score ~0.5: concept irrelevant (random). TCAV score ~1.0: concept strongly drives prediction.

Step 4 — Statistical Significance Testing:
- Generate random CAVs from random concept sets.
- Run two-sided t-test: is the real TCAV score significantly different from random?
- Only report concepts with statistically significant TCAV scores.

TCAV Discoveries

- Medical AI: A diabetic retinopathy model had high TCAV scores for "microaneurysm" (correct) and also for "image artifacts from specific camera model" (spurious) — revealing a camera-correlated bias.
- ImageNet Models: Models classify "doctor" using "stethoscope" concept (appropriate) and "white coat" concept (appropriate) but also "gender cues" concept (biased).
- Inception Classification: Zebra classification has very high TCAV score for "stripes" — confirming the model uses semantically meaningful features.

Concept Types

| Concept Type | Examples | Discovery Method |
|-------------|----------|-----------------|
| Visual texture | Stripes, dots, roughness | Curated image sets |
| Clinical findings | Microaneurysm, mass shape | Expert-labeled medical images |
| Demographic attributes | Skin tone, gender presentation | Controlled image sets |
| Semantic categories | "Outdoors", "people", "text" | Web images by category |
| Model-discovered | Via dimensionality reduction | Automated concept extraction |

Automated Concept Extraction (ACE):
- Extension of TCAV that automatically discovers concepts without human curation.
- Cluster image patches by similarity in activation space; each cluster becomes a candidate concept.
- Run TCAV with automatically discovered clusters to find high-importance concepts.

TCAV vs. Other Explanation Methods

| Method | Explanation Level | Human-Defined? | Causal? |
|--------|------------------|----------------|---------|
| Saliency Maps | Pixel | No | No |
| LIME | Feature | No | No |
| SHAP | Feature | No | No |
| Integrated Gradients | Pixel/token | No | No |
| TCAV | Concept | Yes | Approximate |

TCAV is the explanation method that speaks the language of domain experts — by testing whether AI systems use the same semantic concepts that radiologists, biologists, and engineers use to reason about their domains, TCAV bridges the gap between machine activation patterns and human conceptual understanding, enabling expert validation of AI reasoning at the level of domain knowledge rather than raw pixel statistics.

TCAV (Testing with Concept Activation Vectors)

Want to learn more?