Differential Privacy (DP)

Differential Privacy (DP) is the mathematical framework that provides a formal, quantifiable guarantee that an algorithm's output reveals negligibly different information whether or not any individual's data is included in the computation — enabling statistical analysis, model training, and data publishing with provable privacy protection, making it the gold standard privacy technology adopted by Apple, Google, Microsoft, and the U.S. Census Bureau.

What Is Differential Privacy?

- Definition: A randomized algorithm M satisfies (ε, δ)-differential privacy if for all datasets D and D' differing in one record, and for all sets of outputs S:
P(M(D) ∈ S) ≤ e^ε × P(M(D') ∈ S) + δ
- Intuition: The probability distribution of outputs is nearly identical whether or not any individual's record is included — an adversary observing the output cannot determine with high confidence whether a specific person participated.
- Privacy Budget ε: The privacy loss parameter — smaller ε = stronger privacy. ε=0 = perfect privacy (no information leaked); ε=∞ = no privacy guarantee. Practical values: ε=0.1 (strong) to ε=10 (weak but useful for ML).
- δ (Failure Probability): Probability that the ε bound is violated. Typically set to 1/n² where n = dataset size. Pure DP: δ=0; Approximate DP: δ > 0.

Why Differential Privacy Matters

- Legal Compliance: GDPR, CCPA, and emerging AI regulations increasingly recognize differential privacy as a gold standard for privacy-preserving data analysis — regulatory safe harbor for aggregate statistics.
- Census Protection: U.S. Census Bureau deployed DP for 2020 Census — adding calibrated noise to prevent database reconstruction attacks that had successfully reconstructed 17% of 2010 Census records.
- Mobile Data Collection: Apple uses DP for emoji frequency, Health app data, and keyboard autocorrect improvements — collecting aggregate statistics without seeing individual user data.
- Federated Learning: Google uses DP-SGD in Gboard (next-word prediction) and other on-device ML — each client's gradient contribution is DP-protected before aggregation.
- Medical Research: DP enables hospital networks to compute joint statistics without sharing patient records — enabling research impossible under strict HIPAA data-sharing rules.

The Fundamental Mechanisms

Laplace Mechanism (for numeric queries):
- For query f(D) with sensitivity Δf = max|f(D) - f(D')|:
- M(D) = f(D) + Laplace(0, Δf/ε) — add Laplace noise scaled to sensitivity/ε.
- Result satisfies ε-DP.

Gaussian Mechanism (for approximate DP):
- M(D) = f(D) + N(0, σ²) where σ = Δf √(2 ln(1.25/δ)) / ε.
- Satisfies (ε, δ)-DP.

Randomized Response (for local DP):
- Each user reports true value with probability p = e^ε/(e^ε+1), random value otherwise.
- Enables local privacy — server never sees true individual responses.

DP-SGD (for Machine Learning):
- Abadi et al. (2016) "Deep Learning with Differential Privacy" — extends DP to neural network training.
- For each mini-batch:
1. Compute per-example gradients g_i.
2. Clip: g_i ← g_i / max(1, ||g_i||₂/C) — bound L2 sensitivity.
3. Sum clipped gradients and add Gaussian noise: G = Σg_i + N(0, σ²C²I).
4. Update: θ ← θ - lr × G/|batch|.
- Privacy accounting: Track cumulative privacy loss ε across all training steps using moments accountant or RDP accountant.

Privacy-Utility Trade-off

| Application | ε Used | Utility Cost |
|-------------|--------|-------------|
| Census (U.S. 2020) | 17.14 (total) | <5% accuracy loss on aggregate statistics |
| Apple Emoji (Local DP) | 4 | Moderate |
| Google Gboard | ~8-10 | Small |
| Medical ML (DP-SGD) | 1-3 | 5-15% accuracy loss |
| Strong ML privacy | ε<1 | 20-40% accuracy loss |

The privacy-utility trade-off is fundamental — smaller ε means more noise means less accurate models. Current DP-SGD models on CIFAR-10 achieve ~85% accuracy at ε=3 vs ~95% without DP.

Composition Theorems

Running M₁ and M₂ on the same dataset:
- Basic composition: (ε₁+ε₂, δ₁+δ₂)-DP.
- Advanced composition: Better bounds using moments accountant (MA), Rényi DP (RDP), or zero-concentrated DP (zCDP).
- Subsampling amplification: If M is (ε,δ)-DP, running M on a random subsample of fraction q gives approximately (qε, qδ)-DP — privacy amplification from subsampling.

Differential privacy is the mathematical guarantee that converts privacy from a vague aspiration into an engineering specification — by defining privacy loss as a precisely measurable quantity, DP enables organizations to make explicit, auditable commitments about how much individual data influences computational outputs, transforming privacy from a legal compliance checkbox into a rigorous engineering constraint.

Want to learn more?