Monitoring and observability for production AI systems

Why traditional monitoring is not enough for AI systems

Standard infrastructure monitoring tells you that your services are up, your instances are healthy and your APIs are responding. For AI systems, none of this is sufficient. A production AI system can be fully operational by every infrastructure metric while silently returning hallucinated answers, stale retrieved context, or scores generated from a stale prompt version.

AI observability requires a second layer of monitoring specific to model behaviour, retrieval quality and pipeline integrity. Without it, you discover failures when users report them — not when your monitoring alerts you.

The four layers of AI observability

Layer 1: Infrastructure metrics

Start with the basics: uptime, latency, error rate, throughput. These are necessary but not sufficient. For AI workloads, add:

Model API response time (P50, P95, P99) tracked separately from application response time
API error rate by error type (rate limit, timeout, content filter, context length exceeded)
Queue depth for async processing pipelines — a growing queue signals processing is falling behind
GPU memory utilisation and inference throughput for self-hosted models

Alert thresholds for AI APIs should be tighter than for traditional APIs. A model API responding in 8 seconds instead of 2 seconds is a user experience problem even if it is technically "available".

Layer 2: Model quality metrics

This is the layer that most teams build late, when they should build it first. Model quality metrics measure whether the AI is doing what it is supposed to do, not just whether it is responding:

Output structure validation: If the model is supposed to return structured JSON, alert when the output cannot be parsed. This catches prompt drift, model version changes and context length truncation.
Confidence signals: For scoring or classification use cases, track the distribution of confidence scores over time. A shift in the distribution indicates something has changed — in the data, the prompt or the model.
Refusal and filter rate: Track how often the model refuses to answer or is blocked by content filters. A spike in refusals often indicates a prompt has changed in a way the model interprets as unsafe.
Human feedback rate: If your application supports thumbs up/down or similar feedback, track it. A declining positive feedback rate is the earliest signal of model quality degradation available to you.

Layer 3: Retrieval quality metrics

For RAG systems, retrieval is a separate observability domain. The retrieval layer can degrade independently of the model layer, and the failure modes are different:

Retrieval hit rate: What percentage of queries return at least one document above the relevance threshold? A declining hit rate means the index is stale or the query patterns have shifted.
Top-k similarity distribution: Track the similarity scores of retrieved chunks over time. A downward drift in scores means retrieval quality is degrading.
Index freshness by source: For each ingestion source, track the timestamp of the most recently indexed document. Alert when a source has not been updated in longer than its expected refresh interval.
Context utilisation: Periodically evaluate whether the LLM is actually using retrieved context or generating from parametric knowledge. This can be tested with a set of held-out questions that have known correct answers only in the indexed documents.

A RAG system whose vector index has not been refreshed in three weeks is answering questions from stale data. Your observability should tell you this before a user does.

Layer 4: Pipeline and agent tracing

Multi-step AI pipelines and autonomous agents fail in ways that are difficult to diagnose without distributed tracing. A user reports a wrong answer — but which step in the five-step pipeline produced it? Which tool call did the agent make? What was the retrieved context at the point of failure?

Every AI pipeline in production should emit a trace that captures:

A unique trace ID propagated through every step of the pipeline
The input and output at each step (or a hash of the content if full logging is too expensive)
Token counts and latency per step
Tool calls and their results for agent workflows
The final output and any structured metadata about how it was produced

LangSmith, Helicone, Braintrust and Langfuse all provide purpose-built tracing for LLM applications. Most of them can also run evaluation suites against stored traces, enabling regression detection when you change a prompt or model.

Alerting strategy

Alert on these conditions at minimum:

Model API error rate above 2% for any 5-minute window
Output parse failure rate above 0.5% for structured-output workflows
P95 end-to-end latency exceeding the defined SLA for more than 3 consecutive minutes
Queue depth growing for more than 10 minutes without shrinking
Any ingestion source not refreshed within 2x its expected refresh interval
Daily token spend exceeding 20% above the 7-day moving average

Do not create alerts for everything — alert fatigue leads to ignored alerts. Tier your alerts: immediate page for anything that blocks users, daily digest for degraded-but-functional conditions, weekly review for trend-based signals.

The operational mindset shift

Traditional software operations asks: "Is the system up?" AI operations asks: "Is the system working correctly?" The gap between those two questions is where observability investment needs to go.

The teams that operate AI systems well have shifted from reactive incident response to continuous quality monitoring. They know their model quality baseline before users experience a degradation, and they have the traces to diagnose it before the post-mortem.