Home Knowledge Base Error Budget

Error Budget is the quantified allowance for unreliability derived from an SLO that teams can "spend" on risky deployments and experiments while it remains positive, or must conserve by freezing changes when it is depleted — the SRE (Site Reliability Engineering) mechanism that transforms reliability from a vague goal into a concrete resource governing the pace of innovation.

What Is an Error Budget?

Why Error Budgets Matter

Error Budget Calculation

For a 99.9% availability SLO over 30 days:

Total requests in 30 days: assume 1,000,000 requests. Allowed failures: 1,000,000 × 0.001 = 1,000 failed requests. Budget remaining after 500 failures: 500 requests (50% remaining). Budget burn rate: 500 failures / 30 days = 16.7 failures/day → on pace to stay within budget.

For a 99.9% latency SLO (p99 < 2s) over 30 days: Allowed minutes above threshold: 30 × 24 × 60 × 0.001 = 43.2 minutes. Budget remaining after 20 minutes of violations: 23.2 minutes (54% remaining).

Error Budget Policy

A formal Error Budget Policy defines what happens at different burn levels:

Budget RemainingStatusAllowed Actions
100% - 50%HealthyAll changes permitted; experiments encouraged
50% - 25%CautionHigh-risk changes require additional review
25% - 10%WarningOnly critical bug fixes; feature freezes
< 10%CriticalAll changes frozen; reliability sprint
0% (SLO violated)BreachPost-mortem required; SLA credits triggered

Error Budget in AI/LLM Contexts

AI systems introduce complexity beyond traditional web services:

Model Deployment Risk: Swapping a model version (GPT-4o → GPT-4o-mini) may degrade response quality in ways that are hard to detect quickly — error budget should account for quality degradation, not just availability.

External API Dependencies: If OpenAI has an outage consuming your error budget, you've "spent" budget you didn't choose to spend — error budget policies should distinguish self-caused vs dependency-caused consumption.

Chaos Engineering Budget: Teams can deliberately consume error budget by running chaos experiments (kill a pod, inject network latency) — this "spends" budget but improves long-term resilience.

Seasonal Variance: AI services may have predictable load spikes (product launches, end-of-quarter) — error budgets can be seasonally adjusted to give teams more runway during known risk periods.

Fast Burn vs Slow Burn

An incident consuming 10% of your monthly budget in 1 hour is a fast-burn alert — must be paged immediately. An incident consuming 5% per day is a slow-burn alert — less urgent but will eventually breach SLO; needs attention within hours.

Alerting should fire on both: fast-burn for immediate response, slow-burn for proactive intervention before SLO breach.

Error budgets are the operational currency of reliable AI systems — by converting the abstract goal of reliability into a finite, spendable resource with explicit policies governing its use, error budgets enable AI teams to ship ambitious features rapidly when systems are healthy and enforce the discipline to fix foundations when reliability is under stress.

error budgetreliabilityspend

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.