Chaos Engineering

Chaos Engineering is the discipline of intentionally injecting controlled failures into production or staging AI systems to discover weaknesses before unplanned outages expose them to users — transforming reliability engineering from reactive incident response to proactive resilience building through structured experimentation.

What Is Chaos Engineering?

- Definition: The practice of deliberately introducing faults (network partitions, latency, resource exhaustion, service failures) into systems to verify that they can withstand turbulent real-world conditions and degrade gracefully rather than catastrophically.
- Origin: Invented by Netflix (2011) with "Chaos Monkey" — a tool that randomly terminated EC2 instances in production to force engineers to build resilient, redundant systems.
- Hypothesis-Based: Chaos engineering is scientific — form a hypothesis ("If the vector DB becomes unavailable, the RAG pipeline will fall back to keyword search"), run the experiment, observe results, and either confirm resilience or discover a weakness to fix.
- Controlled Blast Radius: Unlike real incidents, chaos experiments are controlled — scope is limited, duration is bounded, rollback is instant, and monitoring is heightened.

Why Chaos Engineering Matters for AI Systems

- Complex Dependencies: AI production systems depend on vector databases, embedding services, LLM APIs, rerankers, and cache layers — any one failing can cascade.
- External API Risk: LLM providers (OpenAI, Anthropic) have outages — does your system have fallback models, cached responses, or graceful degradation when the primary API is unavailable?
- Model Serving Complexity: GPU out-of-memory, CUDA errors, and model loading failures are unique failure modes requiring specific recovery paths.
- Silent Degradation: AI systems can degrade silently — wrong retrieval context produces confident but wrong answers, invisible without semantic monitoring and chaos testing.
- Cold Start Validation: Chaos tests verify that systems recover correctly from cold starts (container restarts, autoscaling events) not just steady-state operation.

AI-Specific Chaos Scenarios

LLM API Failures:
- Inject: OpenAI API returns 503 for all requests.
- Hypothesis: System falls back to local Llama model within 5 seconds.
- Measure: Fallback success rate, latency increase, response quality degradation.

Vector Database Unavailability:
- Inject: Block all connections to the vector DB.
- Hypothesis: RAG pipeline falls back to BM25 keyword search; users receive lower-quality but valid responses.
- Measure: Fallback activation rate, response relevance score, error rate.

Network Latency Injection:
- Inject: Add 500ms latency to all calls from API server to embedding service.
- Hypothesis: p99 latency increases proportionally but timeout handling prevents cascading failures.
- Measure: TTFT distribution shift, timeout rate, circuit breaker activation.

GPU Memory Pressure:
- Inject: Allocate 80% of available VRAM with a competing process.
- Hypothesis: Inference server queues requests rather than OOM-crashing; queue depth alert fires.
- Measure: OOM rate, graceful queuing behavior, alert latency.

Embedding Service Failure:
- Inject: Return random vectors (garbage) from embedding service.
- Hypothesis: Retrieval quality degrades detectably; quality monitoring alerts fire.
- Measure: Retrieval relevance score collapse, alert response time.

Chaos Engineering Tools

| Tool | Type | Best For |
|------|------|---------|
| Chaos Monkey | Netflix OSS | Random instance termination |
| Gremlin | Commercial SaaS | Fine-grained fault injection |
| Chaos Mesh | CNCF, Kubernetes-native | Pod failures, network chaos |
| Litmus | CNCF OSS | Kubernetes chaos experiments |
| tc (Linux) | Built-in | Network latency/packet loss injection |
| stress-ng | Linux | CPU/memory/IO stress |

Chaos Engineering Process

Step 1 — Define Steady State: Establish baseline metrics (error rate, latency, throughput) that define normal operation.
Step 2 — Hypothesize: "If X fails, the system will respond with Y behavior within Z seconds."
Step 3 — Plan the Experiment: Define fault injection method, blast radius, duration, and rollback procedure.
Step 4 — Inject Failure: Apply the fault in a controlled way (start in staging, graduate to production).
Step 5 — Observe: Monitor all relevant metrics throughout the experiment.
Step 6 — Analyze: Did actual behavior match the hypothesis? What weaknesses were revealed?
Step 7 — Fix and Repeat: Address discovered weaknesses and re-run to verify the fix.

GameDay (Chaos Event)

A GameDay is a scheduled chaos event where the entire team participates — SRE, engineering, product — practicing incident response on a known (but not pre-announced to responders) failure. GameDays build muscle memory for real incidents and reveal process gaps alongside technical weaknesses.

Chaos engineering is the reliability discipline that proves AI systems work under adversity before adversity is unplanned — by systematically exploring failure modes through controlled experiments, teams build genuine confidence in production resilience rather than the false assurance of "it worked in testing."

Want to learn more?