AI Error Handling is the set of patterns and strategies for building reliable applications on top of probabilistic, sometimes-failing language model APIs — addressing the unique failure modes of AI systems including hallucination, format violations, safety refusals, rate limits, and context length overflows through defensive programming patterns like self-correction, validation, retry logic, and graceful degradation.
What Is AI Error Handling?
- Definition: Application-layer strategies for detecting, recovering from, and gracefully degrading when AI model calls fail — encompassing both API-level failures (network errors, rate limits, timeouts) and AI-specific failures (hallucination, wrong format, unexpected refusals).
- Unique Challenge: Unlike traditional API failures where errors are binary (success/failure), AI failures are often probabilistic — the model returns HTTP 200 but produces wrong, hallucinated, or incorrectly formatted content.
- Defensive Programming Requirement: AI applications must validate outputs, not just API responses — a successful API call that returns hallucinated JSON is an application-layer failure.
- Production Reality: Without error handling, AI applications fail in ways that are difficult to diagnose and damaging to user trust — unexpected refusals, JSON parse errors, and hallucinated facts all appear as silent failures.
AI-Specific Failure Categories
Hallucination: Model generates factually incorrect, fabricated, or internally inconsistent content.
- Detection: Fact checking against knowledge base; self-consistency checks; human review queues.
- Recovery: Retrieval augmentation (provide facts, ask model to use them); chain-of-thought prompting; self-critique loop.
Format Violations: Model returns prose when JSON was requested, markdown when plain text was needed, or JSON with syntax errors.
- Detection: Schema validation (Pydantic, jsonschema); regex matching for expected patterns.
- Recovery: Self-correction prompt ("Your response was not valid JSON. Please return only valid JSON matching this schema: [schema]"); retry with stronger format instruction; structured output API (function calling, JSON mode).
Safety Refusals: Model refuses legitimate request due to over-sensitive safety training.
- Detection: Check response for refusal phrases; measure refusal rate in monitoring.
- Recovery: Rephrase request with additional context; provide explicit authorization in system prompt; use different model or configuration.
Context Overflow: Input exceeds context window, causing truncation or API error.
- Detection: Token count validation before API call; monitor for truncation warnings.
- Recovery: Chunk large inputs; summarize conversation history; use model with larger context window.
Rate Limiting: API returns 429 (Too Many Requests) when request volume exceeds quota.
- Recovery: Exponential backoff with jitter; request queue with backpressure; per-user rate limiting.
Timeout: Model takes longer than acceptable latency budget.
- Recovery: Streaming responses (return partial output rather than nothing); request cancellation with fallback message; async processing with notification.
Error Recovery Patterns
Pattern 1 — Self-Correction Loop:
``python`
def generate_with_correction(prompt: str, schema: dict, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
response = llm.generate(prompt)
try:
result = json.loads(response)
validate(result, schema) # JSON schema validation
return result
except (json.JSONDecodeError, ValidationError) as e:
# Feed error back to model for self-correction
prompt = f"""Previous response was invalid: {e}
Please provide a corrected response as valid JSON matching: {schema}"""
raise MaxRetriesExceeded("Failed after {max_retries} correction attempts")
Pattern 2 — Structured Output API (Preferred):
Use model-native structured output to eliminate format errors:
`python`
# OpenAI function calling / structured output
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_schema", "json_schema": {"schema": output_schema}}
)
# Response guaranteed to be valid JSON matching schema
Pattern 3 — Ensemble and Majority Vote:
For high-stakes decisions, generate N responses and take the majority:
`python`
responses = [llm.generate(prompt) for _ in range(5)]
# For classification tasks, take majority vote
votes = Counter(responses)
return votes.most_common(1)[0][0]
Reduces hallucination rate significantly for factual questions.
Pattern 4 — Fallback Hierarchy:
`python``
def robust_generate(prompt: str) -> str:
try:
return gpt4o.generate(prompt, timeout=5) # Primary: fast, expensive
except TimeoutError:
try:
return gpt4o_mini.generate(prompt, timeout=10) # Fallback: slower, cheaper
except Exception:
return CANNED_FALLBACK_RESPONSE # Last resort: canned response
Monitoring and Observability
Effective AI error handling requires measurement:
- Refusal rate: % of requests that triggered safety refusals — high rate indicates over-refusal or prompt issues.
- Format error rate: % of responses requiring correction — high rate indicates weak format instructions.
- Retry rate: % of requests requiring at least one retry — high rate indicates API reliability issues.
- Hallucination rate: Measured via fact-checking samples against ground truth — requires human or automated evaluation.
- P50/P95/P99 latency: Including retry overhead — critical for user experience SLAs.
AI error handling is the engineering discipline that bridges the gap between probabilistic AI systems and deterministic production reliability — by treating both API failures and AI-specific failures as first-class engineering concerns with explicit detection, recovery, and fallback strategies, developers build AI applications that maintain user trust and operational reliability even when underlying models misbehave.