Observability

Observability is the ability to understand the internal state of a system by examining its external outputs — specifically its logs, metrics, and traces (the "three pillars"). For AI/ML systems, observability goes beyond traditional software monitoring to include model-specific signals like prediction quality, drift, and safety metrics.

The Three Pillars

- Logs: Structured records of discrete events — request received, model inference started, error occurred, safety filter triggered. Useful for debugging specific incidents.
- Metrics: Numerical measurements aggregated over time — request rate, p95 latency, GPU utilization, token throughput, error rate. Useful for dashboards and alerting.
- Traces: End-to-end request flows showing timing and causality across services — a user request → API gateway → preprocessing → model inference → postprocessing → response. Useful for diagnosing latency and identifying bottlenecks.

AI-Specific Observability

- Model Performance: Track accuracy, quality scores, and evaluation metrics in production.
- Data Drift: Monitor input data distributions for changes that may degrade model performance.
- Concept Drift: Detect when the relationship between inputs and correct outputs changes over time.
- Token Usage: Track input/output tokens per request for cost monitoring and optimization.
- Safety Metrics: Monitor content filter trigger rates, refusal rates, and flagged outputs.
- Hallucination Detection: Track factuality scores or retrieval groundedness metrics.

Observability Tools for ML

- General: Datadog, Grafana + Prometheus, New Relic, Elastic Observability.
- Distributed Tracing: OpenTelemetry, Jaeger, Zipkin for cross-service trace collection.
- ML-Specific: Arize AI, WhyLabs, Fiddler, Arthur for model monitoring and drift detection.
- LLM-Specific: LangSmith, Helicone, Portkey, Braintrust for LLM-specific tracing, evaluation, and cost tracking.

Best Practices

- Structured Logging: Use JSON-formatted logs with consistent fields (request_id, model_version, latency_ms, token_count).
- Correlation IDs: Include a unique ID in every log and trace for a request to enable end-to-end debugging.
- Alerting: Set actionable alerts on key metrics with appropriate thresholds and severity levels.

Observability is the foundation of production reliability — you can't fix what you can't see, and AI systems have more dimensions to observe than traditional software.

Want to learn more?