Membership inference

Membership inference is a privacy attack that determines whether a specific data example was used in a machine learning model's training set. It exploits differences in how models behave on data they were trained on versus data they have never seen, posing a significant privacy risk for models trained on sensitive data.

How Membership Inference Works

- Key Insight: Models tend to be more confident on training data than on unseen data — they assign higher probabilities, show lower loss, and produce more confident predictions for examples they memorized.
- Attack Setup: The attacker has access to the model's output (predictions, probabilities, or confidence scores) and wants to determine if a specific example was in the training set.
- Threshold Method: Compare the model's loss or confidence on the target example against a threshold. Below the threshold → likely a training member.
- Shadow Model Method: Train multiple "shadow" models on known datasets, observe their behavior on members vs. non-members, and train a binary classifier to distinguish the two.

Attack Scenarios

- Healthcare: Determine if a patient's medical record was used to train a diagnostic model (revealing the patient's relationship with a medical institution).
- Legal: Prove that copyrighted content was used for training without authorization.
- LLMs: Determine if specific text passages appear in the training data of GPT-4, Llama, or other models.

Defenses

- Differential Privacy: Add calibrated noise during training to bound the information any single example can leak.
- Regularization: Dropout, weight decay, and early stopping reduce overfitting, which reduces the membership signal.
- Output Perturbation: Add noise to confidence scores or round probabilities before returning them.
- Temperature Scaling: Smooth output distributions to reduce the gap between member and non-member confidence.

Why It Matters

Membership inference demonstrates that simply training a model on data — without explicitly releasing that data — can still leak information about individual training examples. This is a fundamental challenge for privacy-preserving machine learning.

Want to learn more?