Error Budget is the quantified allowance for unreliability derived from an SLO that teams can "spend" on risky deployments and experiments while it remains positive, or must conserve by freezing changes when it is depleted — the SRE (Site Reliability Engineering) mechanism that transforms reliability from a vague goal into a concrete resource governing the pace of innovation.
What Is an Error Budget?
- Definition: The mathematical complement of an SLO — if your SLO is 99.9% availability, your error budget is 0.1% of requests or time that is allowed to fail without violating the SLO.
- Purpose: Error budgets give engineering teams a formal, data-driven framework for deciding when it is safe to ship risky changes vs when to prioritize reliability.
- Origin: Introduced by Google's SRE teams as a solution to the eternal conflict between development (move fast) and operations (don't break things).
- Calculation: Error budget = (1 - SLO target) × time window = allowed failure volume over the measurement period.
Why Error Budgets Matter
- Ends the Reliability Debate: Without an error budget, "Is this deployment risky?" devolves into opinion. With an error budget, the answer is data-driven: "We have 35% of this month's error budget remaining — proceed."
- Aligns Incentives: Dev teams want to ship features; SRE teams want stability. Error budgets align both — dev teams are now incentivized to ensure reliability because depleting the budget freezes their own deployments.
- Permits Calculated Risk: Teams with healthy error budgets can experiment aggressively (new model versions, infrastructure changes) knowing they have margin for failure.
- Forces Prioritization: A depleted error budget mandates reliability work — no more "we'll fix the flaky deployment pipeline later."
- Provides Neutral Arbiter: Escalations about risk become data conversations: "Our error budget for the quarter is 40% depleted after two incidents — we're on pace to breach SLO if we ship the risky migration."
Error Budget Calculation
For a 99.9% availability SLO over 30 days:
Total requests in 30 days: assume 1,000,000 requests.
Allowed failures: 1,000,000 × 0.001 = 1,000 failed requests.
Budget remaining after 500 failures: 500 requests (50% remaining).
Budget burn rate: 500 failures / 30 days = 16.7 failures/day → on pace to stay within budget.
For a 99.9% latency SLO (p99 < 2s) over 30 days:
Allowed minutes above threshold: 30 × 24 × 60 × 0.001 = 43.2 minutes.
Budget remaining after 20 minutes of violations: 23.2 minutes (54% remaining).
Error Budget Policy
A formal Error Budget Policy defines what happens at different burn levels:
| Budget Remaining | Status | Allowed Actions |
|-----------------|--------|-----------------|
| 100% - 50% | Healthy | All changes permitted; experiments encouraged |
| 50% - 25% | Caution | High-risk changes require additional review |
| 25% - 10% | Warning | Only critical bug fixes; feature freezes |
| < 10% | Critical | All changes frozen; reliability sprint |
| 0% (SLO violated) | Breach | Post-mortem required; SLA credits triggered |
Error Budget in AI/LLM Contexts
AI systems introduce complexity beyond traditional web services:
Model Deployment Risk: Swapping a model version (GPT-4o → GPT-4o-mini) may degrade response quality in ways that are hard to detect quickly — error budget should account for quality degradation, not just availability.
External API Dependencies: If OpenAI has an outage consuming your error budget, you've "spent" budget you didn't choose to spend — error budget policies should distinguish self-caused vs dependency-caused consumption.
Chaos Engineering Budget: Teams can deliberately consume error budget by running chaos experiments (kill a pod, inject network latency) — this "spends" budget but improves long-term resilience.
Seasonal Variance: AI services may have predictable load spikes (product launches, end-of-quarter) — error budgets can be seasonally adjusted to give teams more runway during known risk periods.
Fast Burn vs Slow Burn
An incident consuming 10% of your monthly budget in 1 hour is a fast-burn alert — must be paged immediately.
An incident consuming 5% per day is a slow-burn alert — less urgent but will eventually breach SLO; needs attention within hours.
Alerting should fire on both: fast-burn for immediate response, slow-burn for proactive intervention before SLO breach.
Error budgets are the operational currency of reliable AI systems — by converting the abstract goal of reliability into a finite, spendable resource with explicit policies governing its use, error budgets enable AI teams to ship ambitious features rapidly when systems are healthy and enforce the discipline to fix foundations when reliability is under stress.