Debugging LLM applications

Debugging LLM applications is the systematic process of identifying and fixing issues in AI-powered systems — addressing problems like hallucinations, format errors, inconsistent behavior, and performance issues through logging, tracing, prompt iteration, and systematic testing of LLM interactions.

What Is LLM Debugging?

- Definition: Finding and fixing problems in LLM-based applications.
- Challenge: Non-deterministic outputs make traditional debugging harder.
- Approach: Combine logging, tracing, eval sets, and prompt engineering.
- Goal: Reliable, high-quality AI application behavior.

Why LLM Debugging Is Different

- Non-Determinism: Same input can produce different outputs.
- Black Box: Can't step through model internals.
- Subjective Quality: "Good" responses are often judgment calls.
- Context Sensitivity: Behavior depends on full conversation history.
- Emergent Behaviors: Unexpected outputs from prompt combinations.

Common Issues & Solutions

Hallucinations:
``Problem: Model confidently states incorrect information Solutions: - Add retrieval (RAG) for grounded answers - Implement fact-checking step - Add "say I don't know if uncertain" instruction - Verify against source documents`

Wrong Format:`Problem: Output doesn't match expected structure Solutions: - Provide explicit format examples - Use JSON mode / structured output - Include format specification in prompt - Post-process to extract/validate`

Excessive Verbosity:`Problem: Responses are too long or include unwanted content Solutions: - Add "Be concise" instruction - Specify word/sentence limits - Use "Answer only with X" directive - Truncate in post-processing`

Inconsistent Behavior:`Problem: Different responses for similar inputs Solutions: - Lower temperature (more deterministic) - More specific instructions - Few-shot examples for consistency - Validate outputs before returning`

Debugging Checklist

`□ Check prompt formatting - Correct template substitution? - Special characters escaped? - Proper message structure?

□ Verify model configuration - Correct model version? - Appropriate temperature? - Sufficient max_tokens?

□ Test with minimal input - Does simple case work? - Isolate the failing component

□ Review context/history - Is conversation history correct? - Too much context overwhelming?

□ Add explicit instructions - Be more specific about desired behavior - Provide examples of good/bad outputs`

Debugging Tools

Tracing & Observability:`Tool | Features ---------------|---------------------------------- LangSmith | LangChain tracing, evals, testing Langfuse | Open source, self-hosted option Phoenix | Debugging for LLM apps Helicone | Logging, analytics Custom logging | Request/response logging`

Tracing Implementation:`python import logging

logging.basicConfig(level=logging.DEBUG)

def call_llm(prompt): logging.debug(f"Prompt: {prompt[:200]}...") response = llm.invoke(prompt) logging.debug(f"Response: {response[:200]}...") logging.info(f"Tokens: {response.usage}") return response`

Systematic Debugging Process

`┌─────────────────────────────────────────────────────┐ │ 1. Reproduce the Issue │ │ - Get exact input that caused problem │ │ - Note model, temperature, system prompt │ ├─────────────────────────────────────────────────────┤ │ 2. Isolate the Component │ │ - Test LLM directly (bypass app logic) │ │ - Test with minimal prompt │ │ - Add/remove context incrementally │ ├─────────────────────────────────────────────────────┤ │ 3. Hypothesize & Test │ │ - Form theory about cause │ │ - Test with modified prompt/params │ │ - Validate fix works consistently │ ├─────────────────────────────────────────────────────┤ │ 4. Implement & Verify │ │ - Apply fix to production │ │ - Add to regression test set │ │ - Monitor for recurrence │ └─────────────────────────────────────────────────────┘`

Building Eval Sets

`python eval_cases = [ { "input": "What is 2+2?", "expected_contains": ["4"], "expected_not_contains": ["5", "3"] }, { "input": "List 3 colors", "validator": lambda r: len(extract_list(r)) == 3 } ]

def run_evals(llm_function): results = [] for case in eval_cases: response = llm_function(case["input"]) passed = validate(response, case) results.append({"case": case, "passed": passed}) return results``

Prompt Debugging Techniques

- A/B Testing: Compare prompt variations.
- Ablation: Remove components to find minimum working prompt.
- Chain-of-Thought: Force reasoning to understand model thinking.
- Self-Critique: Ask model to evaluate its own response.

Debugging LLM applications requires a different mindset than traditional debugging — combining systematic testing, good observability, and iterative prompt refinement to achieve reliable behavior in systems that are inherently probabilistic.

Want to learn more?