Counterfactual Fairness is the causal reasoning-based fairness criterion that requires a model's prediction for an individual to remain the same in a counterfactual world where their protected attribute (race, gender, age) had been different — providing the strongest individual-level fairness guarantee by asking "would this person have received the same decision if they had been a different race or gender, with everything else causally appropriate adjusted?"
What Is Counterfactual Fairness?
- Definition: A prediction Ŷ is counterfactually fair if P(Ŷ_A←a | X=x, A=a) = P(Ŷ_A←b | X=x, A=a) — the prediction would be identical in the counterfactual world where the individual's protected attribute was different.
- Core Framework: Uses causal models (structural equation models) to reason about what would change if a protected attribute were different.
- Key Innovation: Goes beyond statistical correlation to causal reasoning about fairness.
- Origin: Kusner et al. (2017), "Counterfactual Fairness," NeurIPS.
Why Counterfactual Fairness Matters
- Individual Justice: Evaluates fairness at the individual level, not just across groups.
- Causal Reasoning: Distinguishes between legitimate and illegitimate influences of protected attributes.
- Path-Specific: Can identify which causal pathways from protected attributes to outcomes are fair and which are discriminatory.
- Intuitive Appeal: "Would the decision change if this person were a different race?" is naturally compelling.
- Legal Alignment: Closely matches legal concepts of "but-for" causation in discrimination law.
How Counterfactual Fairness Works
| Step | Action | Purpose |
|------|--------|---------|
| 1. Causal Model | Define causal graph relating attributes, features, and outcomes | Map relationships |
| 2. Identify Paths | Trace causal paths from protected attribute to prediction | Find influence channels |
| 3. Counterfactual | Compute prediction with protected attribute changed | Test fairness |
| 4. Compare | Check if prediction changes across counterfactuals | Measure unfairness |
| 5. Intervene | Modify model to equalize counterfactual predictions | Enforce fairness |
Causal Pathways
- Direct Path: Protected attribute → Prediction (always unfair).
- Indirect Path via Proxy: Protected attribute → ZIP code → Prediction (typically unfair).
- Legitimate Path: Protected attribute → Qualification → Prediction (context-dependent).
- Resolving Path: Protected attribute → Effort → Achievement → Prediction (arguably fair).
Advantages Over Statistical Fairness
- Individual-Level: Evaluates fairness for each person, not just group averages.
- Causal Clarity: Distinguishes legitimate from illegitimate feature influences.
- Handles Proxies: Identifies and addresses proxy discrimination through causal paths.
- Compositional: Can allow some causal paths while blocking others.
Limitations
- Causal Model Required: Requires specifying a causal graph, which may be contested or unknown.
- Counterfactual Identity: "What would this person be like as a different race?" is philosophically complex.
- Computational Cost: Computing counterfactuals through structural equation models is expensive.
- Sensitivity: Results depend heavily on the assumed causal structure.
Counterfactual Fairness is the most principled approach to individual-level algorithmic fairness — grounding fairness in causal reasoning rather than statistical correlation, providing intuitive guarantees about how decisions would change in counterfactual worlds where protected attributes were different.