Model stealing

Model stealing (model extraction) is an adversarial attack that reconstructs a functional replica of a proprietary machine learning model by systematically querying its prediction API — enabling attackers to obtain a substitute model that approximates the target's decision boundaries, architecture, or parameters through carefully designed input queries and observed output patterns, threatening intellectual property rights, enabling cheaper adversarial attack generation, and undermining model watermarking and access-control revenue models.

Why Model Stealing Matters

Training large ML models costs millions of dollars in compute and months of engineering effort. Model APIs represent significant IP:
- OpenAI's GPT-4: estimated $78M+ training cost
- Google's Gemini: comparable scale
- Custom enterprise models: years of domain-specific data collection and fine-tuning

Model stealing attacks allow competitors to approximate this capability without the training cost, potentially:
- Violating terms of service and IP laws
- Bypassing access controls and rate limiting through bulk queries
- Creating "oracle" attacks — using the stolen model as a white-box stand-in for black-box adversarial attacks
- Extracting proprietary training data signals embedded in model behavior

Attack Categories

Equation-solving attacks (Tramer et al., 2016): For simple models (logistic regression, SVMs), the decision boundary is determined by a small number of parameters. Strategic queries near decision boundaries extract these parameters directly.

For a d-dimensional linear model: d+1 equations (from d+1 strategic queries) uniquely determine all d weights and the bias. Complete extraction with minimal queries.

Model distillation attacks: Query the target API to generate a large synthetic labeled dataset, then train a local substitute model using standard supervised learning:
1. Design query distribution (uniform random, adaptive sampling near boundaries, natural inputs)
2. Submit queries to target API, collect probability distributions (soft labels)
3. Train substitute model on (query, soft label) pairs using knowledge distillation
4. Iterate: use current substitute model to identify high-information query regions

Soft probability outputs (rather than hard labels) dramatically accelerate extraction — they contain richer information about the target's decision surface per query.

Active learning attacks: Use uncertainty sampling to intelligently select query points that maximize information about the decision boundary, minimizing the number of API calls required for a given approximation quality.

Side-channel attacks: Infer model properties from timing signals, memory access patterns, or power consumption during inference:
- Inference latency reveals layer count and approximate width
- Cache timing reveals model architecture and batch size
- Memory access patterns can leak weight sparsity structure

Extraction Metrics and Fidelity

| Metric | What It Measures |
|--------|-----------------|
| Accuracy agreement | Fraction of inputs where stolen model matches target's prediction |
| Label fidelity | Hard-label agreement on standard benchmarks |
| Soft-label fidelity | KL divergence between probability distributions |
| Adversarial transferability | Attack success rate using stolen model as surrogate |

High adversarial transferability is particularly dangerous — a stolen model with even modest accuracy agreement can serve as an effective surrogate for generating adversarial examples against the original API.

Defenses

Output perturbation: Add calibrated noise to probability outputs. Reduces extraction fidelity but degrades legitimate use cases. Differential privacy mechanisms provide provable degradation bounds.

Prediction rounding: Return top-k labels rather than full probability distributions. Dramatically reduces information per query but changes API semantics.

Query rate limiting and anomaly detection: Flag accounts submitting statistically unusual query patterns (systematic boundary probing, high volume from single IP). Effective against naive attacks but not adaptive attackers using distributed infrastructure.

Model watermarking: Embed backdoor behaviors in the target model that transfer to extracted copies. If the stolen model exhibits the watermark behavior, theft is provable. Watermark design must resist removal by fine-tuning and standard training.

Prediction API redesign: Return explanations or feature importances instead of raw probabilities — these may contain less information about decision boundaries while being more useful for legitimate users.

The model stealing threat has motivated the development of provably hard-to-extract models (cryptographic model protection) as an active research direction, though practical deployments remain elusive.

Want to learn more?