AI Error Handling

AI Error Handling is the set of patterns and strategies for building reliable applications on top of probabilistic, sometimes-failing language model APIs — addressing the unique failure modes of AI systems including hallucination, format violations, safety refusals, rate limits, and context length overflows through defensive programming patterns like self-correction, validation, retry logic, and graceful degradation.

What Is AI Error Handling?

- Definition: Application-layer strategies for detecting, recovering from, and gracefully degrading when AI model calls fail — encompassing both API-level failures (network errors, rate limits, timeouts) and AI-specific failures (hallucination, wrong format, unexpected refusals).
- Unique Challenge: Unlike traditional API failures where errors are binary (success/failure), AI failures are often probabilistic — the model returns HTTP 200 but produces wrong, hallucinated, or incorrectly formatted content.
- Defensive Programming Requirement: AI applications must validate outputs, not just API responses — a successful API call that returns hallucinated JSON is an application-layer failure.
- Production Reality: Without error handling, AI applications fail in ways that are difficult to diagnose and damaging to user trust — unexpected refusals, JSON parse errors, and hallucinated facts all appear as silent failures.

AI-Specific Failure Categories

Hallucination: Model generates factually incorrect, fabricated, or internally inconsistent content.
- Detection: Fact checking against knowledge base; self-consistency checks; human review queues.
- Recovery: Retrieval augmentation (provide facts, ask model to use them); chain-of-thought prompting; self-critique loop.

Format Violations: Model returns prose when JSON was requested, markdown when plain text was needed, or JSON with syntax errors.
- Detection: Schema validation (Pydantic, jsonschema); regex matching for expected patterns.
- Recovery: Self-correction prompt ("Your response was not valid JSON. Please return only valid JSON matching this schema: [schema]"); retry with stronger format instruction; structured output API (function calling, JSON mode).

Safety Refusals: Model refuses legitimate request due to over-sensitive safety training.
- Detection: Check response for refusal phrases; measure refusal rate in monitoring.
- Recovery: Rephrase request with additional context; provide explicit authorization in system prompt; use different model or configuration.

Context Overflow: Input exceeds context window, causing truncation or API error.
- Detection: Token count validation before API call; monitor for truncation warnings.
- Recovery: Chunk large inputs; summarize conversation history; use model with larger context window.

Rate Limiting: API returns 429 (Too Many Requests) when request volume exceeds quota.
- Recovery: Exponential backoff with jitter; request queue with backpressure; per-user rate limiting.

Timeout: Model takes longer than acceptable latency budget.
- Recovery: Streaming responses (return partial output rather than nothing); request cancellation with fallback message; async processing with notification.

Error Recovery Patterns

Pattern 1 — Self-Correction Loop:
``python def generate_with_correction(prompt: str, schema: dict, max_retries: int = 3) -> dict: for attempt in range(max_retries): response = llm.generate(prompt) try: result = json.loads(response) validate(result, schema) # JSON schema validation return result except (json.JSONDecodeError, ValidationError) as e: # Feed error back to model for self-correction prompt = f"""Previous response was invalid: {e} Please provide a corrected response as valid JSON matching: {schema}""" raise MaxRetriesExceeded("Failed after {max_retries} correction attempts")`

Pattern 2 — Structured Output API (Preferred): Use model-native structured output to eliminate format errors:`python # OpenAI function calling / structured output response = client.chat.completions.create( model="gpt-4o", messages=messages, response_format={"type": "json_schema", "json_schema": {"schema": output_schema}} ) # Response guaranteed to be valid JSON matching schema`

Pattern 3 — Ensemble and Majority Vote: For high-stakes decisions, generate N responses and take the majority:`python responses = [llm.generate(prompt) for _ in range(5)] # For classification tasks, take majority vote votes = Counter(responses) return votes.most_common(1)[0][0]`Reduces hallucination rate significantly for factual questions.

Pattern 4 — Fallback Hierarchy:`python def robust_generate(prompt: str) -> str: try: return gpt4o.generate(prompt, timeout=5) # Primary: fast, expensive except TimeoutError: try: return gpt4o_mini.generate(prompt, timeout=10) # Fallback: slower, cheaper except Exception: return CANNED_FALLBACK_RESPONSE # Last resort: canned response``

Monitoring and Observability

Effective AI error handling requires measurement:
- Refusal rate: % of requests that triggered safety refusals — high rate indicates over-refusal or prompt issues.
- Format error rate: % of responses requiring correction — high rate indicates weak format instructions.
- Retry rate: % of requests requiring at least one retry — high rate indicates API reliability issues.
- Hallucination rate: Measured via fact-checking samples against ground truth — requires human or automated evaluation.
- P50/P95/P99 latency: Including retry overhead — critical for user experience SLAs.

AI error handling is the engineering discipline that bridges the gap between probabilistic AI systems and deterministic production reliability — by treating both API failures and AI-specific failures as first-class engineering concerns with explicit detection, recovery, and fallback strategies, developers build AI applications that maintain user trust and operational reliability even when underlying models misbehave.

Want to learn more?