Alerting and Incident Response

Alerting and Incident Response is the practice of defining threshold-based or anomaly-based rules that automatically notify on-call engineers when AI systems breach acceptable operating boundaries — bridging the gap between observability data and human action to minimize mean time to detection (MTTD) and mean time to resolution (MTTR) for production AI service failures.

What Is Alerting in AI Systems?

- Definition: Automated rules that evaluate metrics, logs, or traces against defined thresholds and trigger notifications (pages, Slack messages, emails) when conditions indicate a service degradation or failure requiring human intervention.
- On-Call Culture: Production AI services run 24/7 — alerting systems route incidents to the appropriate engineer based on scheduled rotations, ensuring someone is always responsible for critical failures even at 3 AM.
- Alert Quality: The goal is not maximum alerts but actionable alerts — every alert should represent a condition requiring immediate human decision-making, not background noise.
- Alert Fatigue: A critical failure mode where too many low-priority alerts train engineers to ignore notifications — the most dangerous state is an on-call engineer who assumes alerts are noise, missing a genuine critical incident.

Why Alerting Matters for AI Infrastructure

- LLM API Outages: When OpenAI or Anthropic APIs go down, downstream applications fail silently without proper alerting — users see generic errors while engineers are unaware.
- GPU Memory Leaks: Memory leak in serving code causes VRAM to fill gradually over hours — alerting catches it before OOM kills the inference server.
- Inference Degradation: A bad model deployment causes p99 latency to spike from 2s to 30s — alerting triggers within minutes, enabling rapid rollback before most users are affected.
- Cost Explosions: A prompt injection attack or buggy client sends millions of long requests — cost alerting catches billing anomalies before they become multi-thousand-dollar surprises.
- Data Pipeline Failures: Embedding pipeline fails to process new documents — alert fires when vector DB staleness exceeds acceptable threshold.

The Alerting Stack

Prometheus AlertManager:
- Evaluates PromQL rules against Prometheus metrics continuously.
- Deduplicates, groups, and routes alerts to appropriate channels.
- Handles silences (planned maintenance windows) and inhibitions.

Example rule:
groups:
- name: inference
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 5
for: 2m
labels:
severity: critical
annotations:
summary: "p99 latency exceeds 5 seconds"

PagerDuty:
- On-call schedule management — routes alerts to correct engineer based on time of day and rotation.
- Escalation policies — if primary on-call doesn't acknowledge within 5 minutes, escalate to secondary.
- Mobile app with phone calls + push notifications — guaranteed wake-up for critical incidents.

OpsGenie: PagerDuty alternative with similar on-call management, popular with Atlassian (Jira/Confluence) shops.

Grafana Alerting: Evaluate Prometheus/Loki queries within Grafana and route to Slack/PagerDuty — consolidates alerting rules with dashboards.

Alert Design Principles

Symptom-Based (Correct):
- "Users cannot complete requests" (high error rate).
- "Response latency exceeds SLO" (p99 > 5s).
- "Service is down" (no successful health checks).

Cause-Based (Incorrect):
- "CPU is 90%" (may be fine — batch processing).
- "Memory is 80%" (may be normal — caching).
- "Disk is filling" (unless near 100%, not urgent).

Alert on symptoms that directly impact users. Cause-based alerts produce noise without actionable urgency.

Severity Levels for AI Systems

| Severity | Condition | Response | SLA |
|----------|-----------|----------|-----|
| Critical/P1 | Service down, 0% success rate | Wake on-call immediately | 15 min response |
| High/P2 | Error rate > 5%, p99 > SLO | Alert on-call within 5 min | 30 min response |
| Medium/P3 | Degraded performance, cost spike | Slack notification, next business day | 4 hours |
| Low/P4 | Approaching limits, minor anomalies | Email, weekly review | Best effort |

AI-Specific Alert Rules

- GPU memory > 90% for 5 minutes → High.
- Inference error rate > 1% for 2 minutes → Critical.
- TTFT p95 > 10s for 5 minutes → High.
- Cost per hour > 2x 7-day average → Medium.
- Vector DB staleness > 24 hours → Medium.
- Model serving pod restart count > 3/hour → High.
- Token generation rate drops > 50% from baseline → High.

Alerting is the human-machine interface for production AI reliability — when designed with care around actionable symptoms rather than cause-based noise, alerting systems transform raw observability data into rapid incident response, protecting user experience and enabling AI teams to sleep soundly knowing critical failures will be caught within minutes.

Want to learn more?