Observability in the Age of AI: How OpenTelemetry is Evolving for LLM Applications
on Observability, Opentelemetry, Llm, Ai engineering, Monitoring, Devops
Observability in the Age of AI: How OpenTelemetry is Evolving for LLM Applications
Traditional observability was built on a simple contract: deterministic systems produce predictable outputs from defined inputs, and deviations from expectations indicate bugs. Your p99 latency spiked? Find the slow query. Your error rate went up? Find the exception.
AI-powered applications break this contract. The same prompt, called twice, produces different outputs. “Correctness” is fuzzy. A response can be technically successful (200 OK, valid JSON, no exception) and still be wrong — the model hallucinated, gave outdated information, or misunderstood the intent.
Observability for AI requires new primitives. This post covers what’s changed and how to instrument your LLM applications effectively.
Photo by Carlos Muza on Unsplash
OpenTelemetry Semantic Conventions for LLMs
The OpenTelemetry project has shipped stable semantic conventions for GenAI (LLM) instrumentation as of the gen_ai semantic convention group in OTel 1.26+. If you’re adding tracing to AI calls, use these conventions rather than inventing your own — it enables ecosystem tooling to understand your telemetry.
Key Span Attributes
from opentelemetry import trace
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes
tracer = trace.get_tracer("my-ai-service")
with tracer.start_as_current_span("llm.chat") as span:
# Standard GenAI attributes
span.set_attribute(gen_ai_attributes.GEN_AI_SYSTEM, "openai")
span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_MODEL, "gpt-5")
span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_MAX_TOKENS, 2000)
span.set_attribute(gen_ai_attributes.GEN_AI_REQUEST_TEMPERATURE, 0.7)
response = call_llm(prompt)
span.set_attribute(gen_ai_attributes.GEN_AI_RESPONSE_MODEL, response.model)
span.set_attribute(gen_ai_attributes.GEN_AI_USAGE_INPUT_TOKENS, response.usage.prompt_tokens)
span.set_attribute(gen_ai_attributes.GEN_AI_USAGE_OUTPUT_TOKENS, response.usage.completion_tokens)
span.set_attribute(gen_ai_attributes.GEN_AI_RESPONSE_FINISH_REASONS, ["stop"])
Auto-Instrumentation
For Python with the OpenAI SDK, the opentelemetry-instrumentation-openai package handles most of this automatically:
pip install opentelemetry-instrumentation-openai
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(
# Optional: capture prompt/completion content (can be sensitive)
enrich_assistant=True,
enrich_token_usage=True
)
# All OpenAI SDK calls are now automatically traced
Auto-instrumentation hooks cover OpenAI, Anthropic, Google Vertex, Cohere, and Bedrock via separate packages.
What to Instrument Beyond Basic Tracing
1. Token Usage as a First-Class Metric
Token costs are your primary AI cost driver. Treat them like memory or CPU — track them as metrics, not just trace attributes.
from opentelemetry import metrics
meter = metrics.get_meter("ai-service")
token_counter = meter.create_counter(
name="gen_ai.client.token.usage",
description="Token usage for GenAI API calls",
unit="tokens"
)
cost_histogram = meter.create_histogram(
name="gen_ai.client.operation.cost",
description="Estimated cost per GenAI API call",
unit="usd"
)
def tracked_llm_call(model: str, messages: list, **kwargs):
response = openai_client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
usage = response.usage
labels = {"gen_ai.system": "openai", "gen_ai.request.model": model}
token_counter.add(usage.prompt_tokens, {**labels, "gen_ai.token.type": "input"})
token_counter.add(usage.completion_tokens, {**labels, "gen_ai.token.type": "output"})
# Model-specific pricing
estimated_cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
cost_histogram.record(estimated_cost, labels)
return response
This gives you dashboards like:
- Token usage by model/endpoint/user over time
- Cost per feature (by adding feature labels)
- Token budget alerts before month-end surprise bills
2. Latency Distribution by Reasoning Effort
With models that expose reasoning controls (GPT-5, Claude 3.7+), latency distributions shift dramatically based on reasoning_effort or thinking parameters. Track them separately.
latency_histogram = meter.create_histogram(
name="gen_ai.client.operation.duration",
description="End-to-end latency for GenAI API calls",
unit="ms"
)
# Include reasoning_effort as a label
latency_histogram.record(
elapsed_ms,
{
"gen_ai.request.model": model,
"reasoning_effort": reasoning_effort, # "low", "medium", "high"
"endpoint": endpoint_name
}
)
You’ll often find that high reasoning effort is 3-5x slower at p50 but has fat tails — the p99 can be 10x median. This matters for timeout configuration.
3. Quality Signals as Metrics
This is the new frontier — instrumenting correctness, not just performance.
quality_gauge = meter.create_observable_gauge(
name="gen_ai.response.quality",
description="AI response quality scores from evaluation",
unit="score"
)
faithfulness_histogram = meter.create_histogram(
name="gen_ai.response.faithfulness",
description="Faithfulness score for RAG responses (0-1)",
unit="score"
)
# After LLM-as-judge evaluation of a response
faithfulness_histogram.record(
faithfulness_score,
{
"endpoint": "product-qa",
"retrieval_strategy": "hybrid",
"model": "gpt-5"
}
)
Tracking quality metrics over time lets you detect when a model change, prompt update, or retrieval change degraded response quality — even if latency and error rates look fine.
Distributed Tracing for Agent Pipelines
Multi-step agent systems are where tracing becomes critical. Without it, debugging a failed 8-step agent run is guesswork.
from opentelemetry import trace, context
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
tracer = trace.get_tracer("agent-pipeline")
async def run_agent_pipeline(user_query: str) -> str:
with tracer.start_as_current_span(
"agent.pipeline",
attributes={
"agent.type": "react",
"user.query": user_query[:200] # Truncate for privacy
}
) as pipeline_span:
result = None
step = 0
while step < MAX_STEPS:
step += 1
with tracer.start_as_current_span(
f"agent.step",
attributes={"agent.step.number": step}
) as step_span:
# LLM call to decide next action
with tracer.start_as_current_span("agent.think") as think_span:
action = await llm_decide_action(
query=user_query,
history=history,
tools=available_tools
)
think_span.set_attribute("agent.action.type", action.type)
if action.type == "final_answer":
result = action.answer
pipeline_span.set_attribute("agent.steps.total", step)
break
# Tool execution
with tracer.start_as_current_span(
"agent.tool_call",
attributes={
"agent.tool.name": action.tool,
"agent.tool.input": str(action.input)[:500]
}
) as tool_span:
tool_result = await execute_tool(action)
tool_span.set_attribute("agent.tool.success", tool_result.success)
return result
This trace structure gives you a waterfall view of every step — which tools were called, how long each LLM call took, where the agent got stuck or looped.
Logging: What to Capture and What to Skip
Capture
import logging
from opentelemetry.sdk._logs import LoggingHandler
# Correlate logs with traces automatically
handler = LoggingHandler()
logging.getLogger().addHandler(handler)
# Include trace context in structured logs
logger = logging.getLogger(__name__)
logger.info("LLM call completed", extra={
"model": model,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"finish_reason": finish_reason,
"latency_ms": latency_ms,
"cache_hit": was_cached
})
Always log:
- Model, tokens, latency, finish reason
- Request ID / trace ID for correlation
- Cache hit/miss status
- Tool calls and their outcomes
Skip (or Gate on Debug Level)
Never log in production:
- Full prompt text (may contain PII, user queries)
- Full response text (volume is enormous, cost is high)
- API keys, auth tokens
Log only on error or with explicit opt-in:
- Truncated prompt (first 200 chars max)
- Truncated response
- User IDs (anonymized/hashed)
# Privacy-safe prompt logging
def safe_log_prompt(prompt: str, max_chars: int = 200) -> str:
if len(prompt) > max_chars:
return prompt[:max_chars] + f"... [truncated, total {len(prompt)} chars]"
return prompt
Dashboards: The Essential Views
Photo by Lukas Blazek on Unsplash
Your Grafana/Datadog setup should have these views as a baseline:
Operational Dashboard:
- Request rate, error rate, latency (p50/p95/p99) — per endpoint
- Token usage rate (input/output) — per model
- Estimated cost (hourly/daily) — per model, per feature
- Cache hit rate
Quality Dashboard:
- Faithfulness scores over time (for RAG)
- LLM-as-judge quality scores by endpoint
- User feedback signals (thumbs up/down rates)
- Retry rates (implicit quality signal)
Cost Optimization Dashboard:
- Cost per successful request by endpoint
- Token efficiency (output tokens / input tokens ratio — low ratios may mean verbose prompts)
- Model cost comparison (GPT-5 vs GPT-4o vs local model)
- Top token-consuming endpoints/users
Alert Thresholds for AI Services
Unlike traditional services, AI service SLOs need to account for quality:
# Example Prometheus alerting rules
groups:
- name: ai_service_alerts
rules:
- alert: HighLLMErrorRate
expr: rate(gen_ai_client_errors_total[5m]) / rate(gen_ai_client_requests_total[5m]) > 0.05
for: 2m
labels:
severity: warning
- alert: LLMLatencyHigh
expr: histogram_quantile(0.95, rate(gen_ai_client_operation_duration_ms_bucket[5m])) > 10000
for: 5m
labels:
severity: warning
- alert: TokenCostSpike
expr: rate(gen_ai_client_token_cost_usd_total[1h]) > 50 # $50/hour
for: 5m
labels:
severity: critical
- alert: LowRAGFaithfulness
expr: histogram_quantile(0.5, rate(gen_ai_response_faithfulness_score_bucket[30m])) < 0.7
for: 10m
labels:
severity: warning
Summary
The shift to AI-powered services doesn’t eliminate the need for observability — it deepens it. The standard stack (OTel traces, Prometheus metrics, structured logs) is still the right foundation. But you need to extend it:
- Adopt OTel GenAI semantic conventions — use standard attribute names, enable ecosystem tooling
- Track tokens as a first-class metric — it’s your primary cost signal
- Instrument quality, not just performance — LLM-as-judge scores tell you what error rates can’t
- Trace agent pipelines step-by-step — black-box AI calls are debuggable if you instrument them
- Build cost dashboards proactively — token costs surprise teams that don’t watch them
The teams shipping reliable AI products in 2026 all have one thing in common: they can answer “what exactly did my system do, and how well did it do it?” Observability is how you build that answer.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
