Home Knowledge Base Shadow mode deployment

Shadow mode deployment runs new models alongside production without affecting user experience — sending traffic to both old and new models, comparing outputs, and validating performance before fully switching, enabling safe validation of model changes in real production conditions.

What Is Shadow Mode?

Why Shadow Mode Matters

Shadow Mode Architecture

User Request
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│                    API Gateway                          │
└─────────────────────────────────────────────────────────┘
     │
     ├──────────────────────────┐
     │                          │ (async)
     ▼                          ▼
┌─────────────────────┐  ┌─────────────────────┐
│  Production Model   │  │   Shadow Model      │
│  (serves response)  │  │   (logs only)       │
└─────────────────────┘  └─────────────────────┘
     │                          │
     ▼                          ▼
  [Response]              [Log for Analysis]
     │                          │
     └──────────────────────────┘
                  │
                  ▼
         ┌───────────────────┐
         │   Comparison DB   │
         └───────────────────┘

Implementation

Basic Shadow Proxy:

import asyncio
from fastapi import FastAPI, Request

app = FastAPI()

async def call_production(request):
    """Call production model and return response."""
    return await production_model.generate(request)

async def call_shadow(request):
    """Call shadow model and log result."""
    try:
        result = await shadow_model.generate(request)
        await log_shadow_result(request, result)
    except Exception as e:
        logger.error(f"Shadow model error: {e}")

@app.post("/v1/generate")
async def generate(request: Request):
    body = await request.json()
    
    # Start shadow call (don't await)
    asyncio.create_task(call_shadow(body))
    
    # Return production response
    response = await call_production(body)
    return response

Traffic Splitting:

import random

def should_shadow(request, shadow_percentage=10):
    """Determine if request should be shadowed."""
    return random.random() < shadow_percentage / 100

@app.post("/v1/generate")
async def generate(request: Request):
    body = await request.json()
    
    # Only shadow some traffic
    if should_shadow(body, shadow_percentage=25):
        asyncio.create_task(call_shadow(body))
    
    return await call_production(body)

Comparison Analysis

Metrics to Compare:

Metric               | How to Compare
---------------------|----------------------------------
Latency              | Shadow P50/P95 vs. production
Output match         | Exact match rate
Semantic similarity  | Embedding similarity of outputs
Error rate           | Shadow failure rate
Token usage          | Cost comparison
Quality              | LLM-as-judge or human eval

Comparison Script:

def analyze_shadow_results():
    results = load_shadow_comparisons()
    
    analysis = {
        "total_samples": len(results),
        "exact_match_rate": sum(r["exact_match"] for r in results) / len(results),
        "avg_similarity": sum(r["semantic_similarity"] for r in results) / len(results),
        "shadow_latency_p50": percentile([r["shadow_latency"] for r in results], 50),
        "shadow_latency_p95": percentile([r["shadow_latency"] for r in results], 95),
        "prod_latency_p50": percentile([r["prod_latency"] for r in results], 50),
        "shadow_error_rate": sum(r["shadow_error"] for r in results) / len(results),
    }
    
    return analysis

Automated Quality Check:

async def evaluate_shadow_quality(prod_response, shadow_response, prompt):
    """Use LLM to judge which response is better."""
    judge_prompt = f"""
    Compare these two responses to the prompt.
    
    Prompt: {prompt}
    
    Response A: {prod_response}
    Response B: {shadow_response}
    
    Which is better? Answer: A, B, or TIE
    Brief justification:
    """
    
    judgment = await judge_llm.generate(judge_prompt)
    return parse_judgment(judgment)

Rollout Decision

Go/No-Go Criteria:

Metric               | Threshold
---------------------|------------------
Latency (P95)        | < 1.2x production
Error rate           | < production
Quality win rate     | > 50%
Semantic similarity  | > 0.95
Shadow coverage      | > 10K requests

Gradual Rollout:

Phase 1: Shadow 5% → validate
Phase 2: Shadow 25% → validate  
Phase 3: Shadow 100% → validate
Phase 4: Canary 5% real traffic
Phase 5: Gradual 5% → 25% → 50% → 100%

Best Practices

Shadow mode deployment is the safest way to validate model changes — by running new models against real production traffic without user impact, teams can catch issues that testing missed and build confidence before committing to a full rollout.

shadow modecanary deploymenta b testingmodel comparisonsafe rolloutproduction testing

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.