Prometheus is the open-source monitoring and alerting toolkit that collects time-series metrics by scraping HTTP endpoints on a pull-based architecture — serving as the industry-standard metrics backend powering observability stacks for AI infrastructure, Kubernetes clusters, and GPU monitoring at companies from startups to hyperscalers.
What Is Prometheus?
- Definition: A pull-based time-series database and monitoring system that periodically scrapes /metrics HTTP endpoints from instrumented applications, stores metrics with labels, and evaluates alerting rules against the collected data.
- Created By: SoundCloud (2012), donated to CNCF (Cloud Native Computing Foundation) in 2016 — now the second most popular CNCF project after Kubernetes.
- Pull vs Push: Unlike traditional monitoring (Nagios, Datadog agents push metrics to a central server), Prometheus pulls metrics from applications — making it easier to discover what is being monitored and avoiding data loss from network partitions.
- Data Model: Every metric is a time-series identified by a metric name plus a set of key-value label pairs — enabling multi-dimensional queries.
Why Prometheus Matters for AI Infrastructure
- GPU Monitoring: NVIDIA's DCGM Exporter exposes GPU temperature, memory usage, SM utilization, and NVLink bandwidth as Prometheus metrics — essential for detecting thermal throttling and memory leaks in training runs.
- Inference Metrics: vLLM, TGI (Text Generation Inference), and Triton Inference Server all natively expose Prometheus metrics for queue depth, TTFT, and throughput.
- Cost Attribution: Track token usage per model, per service, per user — enabling chargeback and cost optimization.
- Kubernetes Integration: Prometheus Operator automates scrape configuration for all pods — critical for dynamic AI serving infrastructure.
- AlertManager Integration: Triggers PagerDuty/Slack alerts when GPU memory exceeds 90% or inference error rate spikes.
Core Concepts
Metric Types:
- Counter: Monotonically increasing value — requests total, tokens generated, errors. Use rate() to compute per-second rate.
- Gauge: Value that can go up or down — GPU memory in use, queue depth, batch size.
- Histogram: Bucketed distribution of values — request latency percentiles (p50, p95, p99).
- Summary: Client-side calculated quantiles — similar to histogram but computed at collection time.
Data Model Example: inference_request_duration_seconds{model="llama-3-70b", status="success", quantization="awq"} = 2.34
Labels enable slicing: query by model, by status, by quantization type independently.
PromQL — The Query Language
rate(inference_requests_total[5m]) → requests per second over last 5 minutes histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) → p99 latency sum by (model) (gpu_memory_used_bytes) → memory usage grouped by model name increase(token_generation_total[1h]) → total tokens generated in last hour
Key Exporters for AI
| Exporter | What It Monitors |
|---|---|
| DCGM Exporter | NVIDIA GPU metrics (temp, memory, utilization) |
| node_exporter | Host CPU, memory, disk, network |
| kube-state-metrics | Kubernetes pod/deployment health |
| vLLM built-in | LLM inference queue, TTFT, throughput |
| postgres_exporter | Vector DB (pgvector) performance |
| redis_exporter | Caching layer hit rate and latency |
Prometheus Architecture
Prometheus Server pulls metrics every 15s (configurable) from:
- Application /metrics endpoints (instrumented with client libraries).
- Exporters (translating non-Prometheus systems like MySQL, NVIDIA GPUs).
- Pushgateway (for short-lived batch jobs that cannot be scraped).
Storage: Local TSDB (time-series database) — efficient compressed blocks, 15 days default retention. Remote Write: Stream metrics to long-term storage (Thanos, Cortex, Grafana Mimir) for years-long retention.
Setting Up GPU Monitoring
Deploy DCGM Exporter as DaemonSet on all GPU nodes. Prometheus scrapes it. Key metrics:
- DCGM_FI_DEV_GPU_UTIL → GPU compute utilization %
- DCGM_FI_DEV_MEM_COPY_UTIL → Memory bandwidth utilization %
- DCGM_FI_DEV_FB_USED → Framebuffer memory used (VRAM)
- DCGM_FI_DEV_GPU_TEMP → Temperature (alert > 80°C)
- DCGM_FI_DEV_POWER_USAGE → Power draw (alert near TDP)
Prometheus is the metrics backbone of modern AI infrastructure — its simple pull-based model, expressive query language, and massive exporter ecosystem make it the universal choice for monitoring everything from GPU temperatures during training runs to token throughput in production inference serving.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.