Shadow mode deployment

Shadow mode deployment runs new models alongside production without affecting user experience — sending traffic to both old and new models, comparing outputs, and validating performance before fully switching, enabling safe validation of model changes in real production conditions.

What Is Shadow Mode?

- Definition: New model receives production traffic but doesn't serve responses.
- Purpose: Validate model behavior with real data before launch.
- Mechanism: Duplicate requests to shadow model, compare results.
- Risk: None to users — only production model serves responses.

Why Shadow Mode Matters

- Real Traffic: Test patterns that synthetic data misses.
- Performance: Measure latency under production load.
- Quality: Compare outputs at scale.
- Confidence: Build evidence before full rollout.
- Rollback-Free: Issues don't affect users.

Shadow Mode Architecture

``User Request │ ▼ ┌─────────────────────────────────────────────────────────┐ │ API Gateway │ └─────────────────────────────────────────────────────────┘ │ ├──────────────────────────┐ │ │ (async) ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ Production Model │ │ Shadow Model │ │ (serves response) │ │ (logs only) │ └─────────────────────┘ └─────────────────────┘ │ │ ▼ ▼ [Response] [Log for Analysis] │ │ └──────────────────────────┘ │ ▼ ┌───────────────────┐ │ Comparison DB │ └───────────────────┘`

Implementation

Basic Shadow Proxy:`python import asyncio from fastapi import FastAPI, Request

app = FastAPI()

async def call_production(request): """Call production model and return response.""" return await production_model.generate(request)

async def call_shadow(request): """Call shadow model and log result.""" try: result = await shadow_model.generate(request) await log_shadow_result(request, result) except Exception as e: logger.error(f"Shadow model error: {e}")

@app.post("/v1/generate") async def generate(request: Request): body = await request.json() # Start shadow call (don't await) asyncio.create_task(call_shadow(body)) # Return production response response = await call_production(body) return response`

Traffic Splitting:`python import random

def should_shadow(request, shadow_percentage=10): """Determine if request should be shadowed.""" return random.random() < shadow_percentage / 100

@app.post("/v1/generate") async def generate(request: Request): body = await request.json() # Only shadow some traffic if should_shadow(body, shadow_percentage=25): asyncio.create_task(call_shadow(body)) return await call_production(body)`

Comparison Analysis

Metrics to Compare:`Metric | How to Compare ---------------------|---------------------------------- Latency | Shadow P50/P95 vs. production Output match | Exact match rate Semantic similarity | Embedding similarity of outputs Error rate | Shadow failure rate Token usage | Cost comparison Quality | LLM-as-judge or human eval`

Comparison Script:`python def analyze_shadow_results(): results = load_shadow_comparisons() analysis = { "total_samples": len(results), "exact_match_rate": sum(r["exact_match"] for r in results) / len(results), "avg_similarity": sum(r["semantic_similarity"] for r in results) / len(results), "shadow_latency_p50": percentile([r["shadow_latency"] for r in results], 50), "shadow_latency_p95": percentile([r["shadow_latency"] for r in results], 95), "prod_latency_p50": percentile([r["prod_latency"] for r in results], 50), "shadow_error_rate": sum(r["shadow_error"] for r in results) / len(results), } return analysis`

Automated Quality Check:`python async def evaluate_shadow_quality(prod_response, shadow_response, prompt): """Use LLM to judge which response is better.""" judge_prompt = f""" Compare these two responses to the prompt. Prompt: {prompt} Response A: {prod_response} Response B: {shadow_response} Which is better? Answer: A, B, or TIE Brief justification: """ judgment = await judge_llm.generate(judge_prompt) return parse_judgment(judgment)`

Rollout Decision

Go/No-Go Criteria:`Metric | Threshold ---------------------|------------------ Latency (P95) | < 1.2x production Error rate | < production Quality win rate | > 50% Semantic similarity | > 0.95 Shadow coverage | > 10K requests`

Gradual Rollout:`Phase 1: Shadow 5% → validate Phase 2: Shadow 25% → validate Phase 3: Shadow 100% → validate Phase 4: Canary 5% real traffic Phase 5: Gradual 5% → 25% → 50% → 100%``

Best Practices

- Sample Traffic: Don't shadow 100% if not needed.
- Async Execution: Shadow shouldn't slow production.
- Cost Awareness: Shadow traffic costs money.
- Time-Bound: Set duration for shadow experiment.
- Automated Alerts: Notify on significant differences.

Shadow mode deployment is the safest way to validate model changes — by running new models against real production traffic without user impact, teams can catch issues that testing missed and build confidence before committing to a full rollout.

Want to learn more?