OpenTelemetry in Production: A Practical Guide to Observability in 2026
in Devops / Observability on Opentelemetry, Observability, Tracing, Metrics, Logs, Devops, Kubernetes, Monitoring
Introduction
In 2026, OpenTelemetry (OTel) has emerged as the undisputed standard for cloud-native observability. What began as a CNCF incubating project has graduated to become the second-largest CNCF project by contributor count (behind only Kubernetes itself). Every major cloud vendor, APM tool, and service mesh now speaks OpenTelemetry natively.
This guide covers practical OTel deployment — not just the theory, but what actually works in production at scale.
Photo by Luke Chesser on Unsplash
Why OpenTelemetry Won
Before OTel, instrumentation was a vendor lock-in nightmare:
- Datadog agent → Datadog only
- Jaeger client → Jaeger only
- CloudWatch agent → AWS only
OpenTelemetry solved this with a vendor-neutral wire format (OTLP) and standardized SDKs across 11+ languages. You instrument once, export anywhere.
The tipping point came when all major vendors converged:
- AWS X-Ray now accepts OTLP natively
- Datadog, New Relic, Honeycomb all support OTLP ingest
- Grafana’s LGTM stack (Loki, Grafana, Tempo, Mimir) is built around it
The OTel Architecture: Three Signals
OpenTelemetry handles three observability signals:
1. Traces (Distributed Tracing)
Traces follow a request as it propagates across microservices. Each unit of work is a span; related spans form a trace.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Instrumentation
tracer = trace.get_tracer(__name__)
def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.source", "api")
result = fetch_order_from_db(order_id)
span.set_attribute("order.value", result.total)
return result
2. Metrics
Metrics are numerical measurements over time — counters, gauges, histograms.
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
# Counter: monotonically increasing value
request_counter = meter.create_counter(
"http.requests.total",
description="Total HTTP requests",
unit="1"
)
# Histogram: distribution of values
request_duration = meter.create_histogram(
"http.request.duration",
description="HTTP request duration",
unit="ms"
)
def handle_request(method: str, path: str):
start = time.time()
try:
response = process_request()
request_counter.add(1, {"method": method, "path": path, "status": "200"})
return response
finally:
duration_ms = (time.time() - start) * 1000
request_duration.record(duration_ms, {"method": method, "path": path})
3. Logs (Structured Logging)
OTel’s log signal connects logs to traces via trace context propagation, giving you correlated logs without a separate log correlation system.
import logging
from opentelemetry.instrumentation.logging import LoggingInstrumentor
LoggingInstrumentor().instrument(set_logging_format=True)
logger = logging.getLogger(__name__)
# Logs now automatically include trace_id and span_id
logger.info("Processing payment", extra={
"payment.amount": 99.99,
"payment.currency": "USD",
"user.id": user_id
})
# Output: [trace_id=abc123 span_id=def456] Processing payment ...
The OTel Collector: Your Observability Router
The OTel Collector is the key infrastructure component — a vendor-agnostic proxy that receives telemetry, processes it, and routes it to one or more backends.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Also scrape Prometheus metrics
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
processors:
# Drop high-cardinality labels
attributes:
actions:
- key: http.user_agent
action: delete
# Sample traces intelligently
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: probabilistic-10pct
type: probabilistic
probabilistic: { sampling_percentage: 10 }
# Add resource attributes
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
exporters:
# Send to Grafana Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Send metrics to Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
# Also send errors to Datadog
datadog:
api:
key: ${env:DD_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, tail_sampling, resource]
exporters: [otlp/tempo, datadog]
metrics:
receivers: [otlp, prometheus]
processors: [resource]
exporters: [prometheus]
Auto-Instrumentation: The Easy Win
OTel’s killer feature for production adoption is zero-code auto-instrumentation. For most frameworks, you get traces and metrics for free:
Python (FastAPI, SQLAlchemy, Redis, etc.)
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
# Run your app with auto-instrumentation
opentelemetry-instrument \
--service_name=order-service \
--exporter_otlp_endpoint=http://otel-collector:4317 \
python app.py
Java (Spring Boot, Hibernate, gRPC, etc.)
# Dockerfile
FROM eclipse-temurin:21-jre
COPY --from=build /app/app.jar /app.jar
COPY otel-javaagent.jar /otel-javaagent.jar
ENV JAVA_TOOL_OPTIONS="-javaagent:/otel-javaagent.jar"
ENV OTEL_SERVICE_NAME="payment-service"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4317"
CMD ["java", "-jar", "/app.jar"]
Kubernetes-Wide (via Operator)
# Install OTel Operator and enable auto-instrumentation per namespace
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector:4317
propagators:
- tracecontext
- baggage
- b3
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
Common Production Pitfalls
1. Cardinality Explosion
High-cardinality labels (user IDs, request IDs in metric labels) can destroy your metrics backend.
# ❌ Bad: unbounded cardinality
request_counter.add(1, {"user_id": user_id, "path": request.path})
# ✅ Good: bounded labels only
request_counter.add(1, {"path": normalize_path(request.path), "method": request.method})
2. Synchronous Exporters in Hot Paths
# ❌ Bad: blocks your request handler
SimpleSpanProcessor(exporter)
# ✅ Good: async batching
BatchSpanProcessor(exporter, max_export_batch_size=512, schedule_delay_millis=5000)
3. Missing Context Propagation
If you use thread pools, async queues, or background jobs, you must propagate trace context manually:
from opentelemetry import context, propagate
# Before putting work on a queue
carrier = {}
propagate.inject(carrier)
queue.put({"task": task_data, "otel_context": carrier})
# When processing from queue
carrier = message["otel_context"]
ctx = propagate.extract(carrier)
token = context.attach(ctx)
try:
process_task(message["task"])
finally:
context.detach(token)
The LGTM Stack: Open Source Observability
For teams wanting full control without vendor costs, the Grafana LGTM stack is now production-ready:
- Loki — Log aggregation (like Elasticsearch but cheaper)
- Grafana — Dashboards and alerting
- Tempo — Distributed tracing (OTLP native)
- Mimir — Horizontally scalable Prometheus-compatible metrics
All four are OTel-native and deployable on Kubernetes with official Helm charts. At moderate scale (< 1000 RPS), a 3-node cluster handles all four services comfortably.
Conclusion
OpenTelemetry in 2026 is no longer an emerging technology — it’s infrastructure. If you’re not using it, you’re probably using something that will eventually migrate to it.
The key takeaway: start with auto-instrumentation, add manual spans for business context, run a collector, and pick backends later. The investment in OTel is portable. Your instrumentation code won’t need to change when you switch from Jaeger to Tempo, or from Datadog to Grafana Cloud.
Observability isn’t optional at scale. OTel makes it achievable without vendor lock-in.
Related Posts:
- Kubernetes Observability: eBPF vs OpenTelemetry in 2026
- Grafana LGTM Stack on Kubernetes: Production Setup Guide
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
