LLM Observability in Production: Tracing, Monitoring, and Debugging AI Applications
on Llm, Observability, Ai, Monitoring, Opentelemetry, Mlops
LLM Observability in Production: Tracing, Monitoring, and Debugging AI Applications
Deploying an LLM is easy. Understanding what it’s doing in production is hard. Unlike traditional APIs where you can trace a database query or cache miss, LLM failures are subtle: hallucinations, token budget overruns, prompt injection attacks, latency spikes, and quality regressions that users notice before your metrics do.
This guide covers the full observability stack for production LLM applications in 2026.
Photo by Luke Chesser on Unsplash
The LLM Observability Problem
Traditional observability (logs + metrics + traces) covers infrastructure well. But LLM applications add new dimensions:
| Traditional | LLM-Specific |
|---|---|
| Response time | Time-to-first-token (TTFT) |
| Error rate | Hallucination rate |
| Throughput | Token consumption |
| Cache hit rate | Context window utilization |
| CPU/Memory | Cost per request |
| N/A | Prompt quality score |
| N/A | Output safety violations |
You need both layers working together.
The Four Pillars of LLM Observability
1. Traces — The Full Request Journey
Every LLM request spans multiple steps: retrieval, prompt construction, model call, response parsing, tool use. Distributed tracing connects them all.
OpenLLMetry (the de facto standard in 2026) instruments LLM calls automatically:
pip install opentelemetry-sdk openllmetry-sdk
pip install opentelemetry-exporter-otlp
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from traceloop.sdk import Traceloop
# Initialize with automatic instrumentation
Traceloop.init(
app_name="my-ai-app",
api_endpoint="https://otel-collector:4317",
headers={"Authorization": "Bearer YOUR_TOKEN"},
)
# Now all LLM calls are automatically traced
from anthropic import Anthropic
client = Anthropic()
@trace.get_tracer(__name__).start_as_current_span("chat_handler")
def handle_chat(user_message: str, user_id: str) -> str:
span = trace.get_current_span()
span.set_attribute("user.id", user_id)
span.set_attribute("user.message.length", len(user_message))
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
# Automatically captured by OpenLLMetry:
# - model name, prompt tokens, completion tokens
# - latency, TTFT
# - full prompt + response (if enabled)
return response.content[0].text
2. Metrics — Aggregate Health Signals
Track these core LLM metrics:
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
import time
meter = metrics.get_meter("llm-app")
# Core counters
request_counter = meter.create_counter(
"llm.requests.total",
description="Total LLM requests"
)
token_counter = meter.create_counter(
"llm.tokens.total",
description="Total tokens consumed"
)
# Histograms for latency
latency_histogram = meter.create_histogram(
"llm.request.duration",
description="LLM request duration in seconds",
unit="s"
)
ttft_histogram = meter.create_histogram(
"llm.time_to_first_token",
description="Time to first token",
unit="s"
)
# Cost tracking
cost_counter = meter.create_counter(
"llm.cost.usd",
description="Estimated cost in USD"
)
def tracked_llm_call(messages: list, model: str = "claude-sonnet-4-5"):
start = time.time()
request_counter.add(1, {"model": model, "env": "production"})
response = client.messages.create(
model=model,
max_tokens=2048,
messages=messages
)
duration = time.time() - start
# Track metrics
latency_histogram.record(duration, {"model": model})
token_counter.add(
response.usage.input_tokens + response.usage.output_tokens,
{"model": model, "type": "total"}
)
# Estimate cost (claude-sonnet-4-5 pricing)
cost = (response.usage.input_tokens * 0.000003 +
response.usage.output_tokens * 0.000015)
cost_counter.add(cost, {"model": model})
return response
3. Logs — Structured LLM Events
Structured logging with full context for debugging:
import structlog
import json
logger = structlog.get_logger()
def log_llm_interaction(
request_id: str,
model: str,
prompt_tokens: int,
completion_tokens: int,
latency_ms: float,
finish_reason: str,
user_id: str = None,
session_id: str = None,
# Enable prompt/response logging only in debug mode
log_content: bool = False,
prompt: str = None,
response: str = None,
):
log_data = {
"event": "llm_call",
"request_id": request_id,
"model": model,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
"latency_ms": round(latency_ms, 2),
"finish_reason": finish_reason,
"estimated_cost_usd": round(
prompt_tokens * 0.000003 + completion_tokens * 0.000015, 6
),
}
if user_id:
log_data["user_id"] = user_id
if session_id:
log_data["session_id"] = session_id
# ⚠️ IMPORTANT: Be careful about PII in prompts
if log_content and prompt:
# Truncate and sanitize before logging
log_data["prompt_preview"] = prompt[:200]
log_data["response_preview"] = response[:200] if response else None
logger.info("llm_interaction", **log_data)
4. Evals — Quality Measurement
This is the LLM-specific layer most teams skip. Big mistake.
from anthropic import Anthropic
from dataclasses import dataclass
from enum import Enum
class EvalResult(Enum):
PASS = "pass"
FAIL = "fail"
UNCLEAR = "unclear"
@dataclass
class QualityScore:
result: EvalResult
score: float # 0-1
reasoning: str
flags: list[str]
class LLMEvaluator:
"""Use a secondary LLM to evaluate primary LLM outputs."""
def __init__(self):
self.client = Anthropic()
def evaluate_response(
self,
user_query: str,
llm_response: str,
expected_properties: list[str] = None
) -> QualityScore:
eval_prompt = f"""You are a quality evaluator for an AI assistant.
Evaluate the following AI response for quality issues.
USER QUERY:
{user_query}
AI RESPONSE:
{llm_response}
Check for:
1. Factual accuracy and hallucinations
2. Relevance to the query
3. Completeness
4. Safety (harmful content, PII exposure)
5. Helpfulness
Return a JSON object:
result"""
eval_response = self.client.messages.create(
model="claude-haiku-3-5", # Use cheaper model for evals
max_tokens=512,
messages=[{"role": "user", "content": eval_prompt}]
)
result = json.loads(eval_response.content[0].text)
return QualityScore(**result)
def check_hallucination(
self,
response: str,
source_documents: list[str]
) -> float:
"""Returns hallucination score: 0 = grounded, 1 = hallucinated."""
sources_text = "\n\n---\n\n".join(source_documents[:5])
prompt = f"""Given these source documents and an AI response,
determine what fraction of the response's factual claims are NOT supported by the sources.
SOURCES:
{sources_text}
AI RESPONSE:
{response}
Return only a number between 0.0 (fully grounded) and 1.0 (fully hallucinated)."""
result = self.client.messages.create(
model="claude-haiku-3-5",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
try:
return float(result.content[0].text.strip())
except ValueError:
return 0.5 # Unknown
The Full Observability Stack
Here’s the complete infrastructure setup using open-source tools:
# docker-compose.yml — LLM Observability Stack
version: '3.8'
services:
# OTel Collector — receives all telemetry
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Prometheus metrics
volumes:
- ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
# Jaeger — distributed tracing
jaeger:
image: jaegertracing/all-in-one:1.55
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # gRPC
# Prometheus — metrics
prometheus:
image: prom/prometheus:v2.50.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# Grafana — dashboards
grafana:
image: grafana/grafana:10.3.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
volumes:
grafana-data:
# otel-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Filter sensitive data from spans
attributes/redact:
actions:
- key: llm.prompts
action: delete # Don't store raw prompts in traces by default
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8888"
# Also send to Langfuse for LLM-specific analytics
otlphttp/langfuse:
endpoint: "https://cloud.langfuse.com/api/public/otel"
headers:
Authorization: "Basic ${LANGFUSE_API_KEY}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes/redact]
exporters: [jaeger, otlphttp/langfuse]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Grafana Dashboard: Key Panels
LLM Overview Dashboard
{
"panels": [
{
"title": "Requests/minute",
"type": "stat",
"targets": [{
"expr": "sum(rate(llm_requests_total[1m]))"
}]
},
{
"title": "P95 Latency",
"type": "gauge",
"targets": [{
"expr": "histogram_quantile(0.95, sum(rate(llm_request_duration_bucket[5m])) by (le))"
}]
},
{
"title": "Token Cost (24h)",
"type": "stat",
"targets": [{
"expr": "sum(increase(llm_cost_usd_total[24h]))"
}]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(llm_requests_total{status='error'}[5m])) / sum(rate(llm_requests_total[5m]))"
}]
}
]
}
Alerting: What to Alert On
# prometheus-alerts.yaml
groups:
- name: llm-alerts
rules:
# High latency
- alert: LLMHighLatency
expr: histogram_quantile(0.95, sum(rate(llm_request_duration_bucket[5m])) by (le, model)) > 10
for: 2m
annotations:
summary: "LLM P95 latency above 10s for "
# Unexpected cost spike
- alert: LLMCostSpike
expr: sum(rate(llm_cost_usd_total[5m])) * 3600 > 50 # $50/hour
for: 5m
annotations:
summary: "LLM spending above $50/hour — possible runaway calls"
# High error rate
- alert: LLMHighErrorRate
expr: sum(rate(llm_requests_total{status="error"}[5m])) / sum(rate(llm_requests_total[5m])) > 0.05
for: 2m
annotations:
summary: "LLM error rate above 5%"
# Quality degradation (from eval pipeline)
- alert: LLMQualityDrop
expr: avg(llm_eval_score) < 0.7
for: 10m
annotations:
summary: "Average LLM response quality below 70%"
Prompt Debugging: Finding Bad Prompts
The hardest part of LLM debugging is identifying which prompts cause failures. Here’s a systematic approach:
import hashlib
from collections import defaultdict
from typing import Optional
class PromptAnalytics:
"""Track performance by prompt template."""
def __init__(self, db_client):
self.db = db_client
def get_prompt_hash(self, template: str) -> str:
"""Create stable hash for a prompt template (before variable substitution)."""
return hashlib.md5(template.encode()).hexdigest()[:8]
async def record_interaction(
self,
prompt_template: str,
variables: dict,
response: str,
latency_ms: float,
tokens: int,
eval_score: Optional[float] = None,
error: Optional[str] = None
):
template_hash = self.get_prompt_hash(prompt_template)
await self.db.insert("llm_interactions", {
"template_hash": template_hash,
"variables_json": json.dumps(variables),
"response_preview": response[:500],
"latency_ms": latency_ms,
"tokens": tokens,
"eval_score": eval_score,
"error": error,
"timestamp": datetime.utcnow()
})
async def get_worst_performing_templates(self, hours: int = 24):
"""Find prompt templates with lowest quality scores."""
return await self.db.query("""
SELECT
template_hash,
COUNT(*) as call_count,
AVG(eval_score) as avg_quality,
AVG(latency_ms) as avg_latency,
SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) as error_count
FROM llm_interactions
WHERE timestamp > NOW() - INTERVAL '%s hours'
GROUP BY template_hash
ORDER BY avg_quality ASC
LIMIT 10
""", hours)
LLM Observability Tools Comparison (2026)
| Tool | Traces | Metrics | Evals | Cost | Self-host |
|---|---|---|---|---|---|
| Langfuse | ✅ | ✅ | ✅ | Free tier | ✅ |
| Helicone | ✅ | ✅ | ✅ | $20/mo | ✅ |
| Phoenix (Arize) | ✅ | ✅ | ✅ | Free | ✅ |
| LangSmith | ✅ | ✅ | ✅ | $39/mo | ❌ |
| Braintrust | ✅ | ✅ | ✅ | Free tier | ❌ |
| Custom OTel stack | ✅ | ✅ | ⚠️ | Infrastructure only | ✅ |
Recommendation in 2026: Start with Langfuse (open-source, self-hostable, excellent UI) and add custom metrics to Prometheus for infrastructure-level alerting.
Quick Start: 5-Minute Setup
# 1. Start Langfuse locally
git clone https://github.com/langfuse/langfuse
cd langfuse
docker compose up -d
# 2. Install Python SDK
pip install langfuse opentelemetry-sdk openllmetry-sdk
# 3. Instrument your app
cat > observability_setup.py << 'EOF'
from langfuse import Langfuse
from traceloop.sdk import Traceloop
# Initialize Langfuse
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="http://localhost:3000"
)
# Auto-instrument all LLM calls
Traceloop.init(app_name="my-app")
print("✅ LLM observability initialized")
EOF
python observability_setup.py
Open http://localhost:3000 — you’ll see traces from your first LLM call.
Conclusion
LLM observability is not optional in production. The teams winning in 2026 are those who can answer:
- Which prompts are causing quality issues?
- Which users are experiencing degraded responses?
- How much is each feature actually costing us in tokens?
- Are we approaching context window limits?
The good news: the tooling is mature. OpenLLMetry + Langfuse + Prometheus/Grafana gives you a complete stack for free. The investment in instrumentation pays off the first time you need to debug a hallucination at 2 AM.
Measure everything. Trust the data, not your gut.
References:
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
