OpenTelemetry in Production: The Complete Observability Stack for 2026
on Opentelemetry, Observability, Devops, Monitoring, Tracing, Kubernetes
Why Observability Matters More Than Ever
Distributed systems are the norm. A single user request might touch dozens of microservices, spawn async jobs, and hit three different databases before a response is returned. When something goes wrong at 2 AM, you need to know exactly what happened — not guess.
OpenTelemetry (OTel) has become the industry-standard answer. In 2026, it’s no longer a “nice to have” — it’s the default instrumentation layer for every serious engineering team.
This guide covers the complete OTel stack: what it is, how it works, and how to deploy it in production without drowning in YAML.
Photo by Carlos Muza on Unsplash
What Is OpenTelemetry?
OpenTelemetry is a vendor-neutral, open-source observability framework that unifies the three pillars of observability:
- Traces — record the path a request takes through your system
- Metrics — quantitative measurements over time (latency, error rate, throughput)
- Logs — structured event records correlated with traces
The key word is vendor-neutral. Instrument your code once with OTel SDKs, then send data to any backend: Jaeger, Zipkin, Tempo, Prometheus, Datadog, Honeycomb, New Relic — your choice, switchable without code changes.
The OTel Architecture
Your Services
│ (OTel SDK)
▼
OTel Collector (agent or gateway)
│
├── Prometheus (metrics)
├── Tempo / Jaeger (traces)
└── Loki / Elasticsearch (logs)
│
▼
Grafana (dashboards + alerts)
The OTel Collector is the linchpin. It receives telemetry from your services, processes/transforms it, and exports to backends. Running a collector means your services never talk directly to backends — swapping backends is a collector config change, not a code change.
Automatic Instrumentation (Zero-Code)
For most common frameworks, OTel provides automatic instrumentation that requires no code changes:
Python
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
# Run your app with auto-instrumentation
opentelemetry-instrument \
--service-name my-python-app \
--exporter-otlp-endpoint http://localhost:4317 \
python app.py
Node.js
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
# instrumentation.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'my-node-service',
});
sdk.start();
Run with:
node --require ./instrumentation.js server.js
Automatic instrumentation captures:
- HTTP client/server spans
- Database queries (PostgreSQL, MySQL, MongoDB, Redis)
- gRPC calls
- Message queue operations (Kafka, RabbitMQ)
Manual Instrumentation: Adding Business Context
Auto-instrumentation covers infrastructure. But the most valuable traces include business context — what user made the request, what product they were buying, why a decision was made.
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("order-service")
def process_order(order_id: str, user_id: str):
with tracer.start_as_current_span("process_order") as span:
# Add business attributes to the span
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
span.set_attribute("order.region", get_user_region(user_id))
try:
# Validate inventory
with tracer.start_as_current_span("check_inventory"):
items = validate_inventory(order_id)
span.set_attribute("order.item_count", len(items))
# Charge payment
with tracer.start_as_current_span("charge_payment") as payment_span:
result = charge_payment(order_id)
payment_span.set_attribute("payment.method", result.method)
payment_span.set_attribute("payment.amount_cents", result.amount)
span.set_status(StatusCode.OK)
return {"status": "success"}
except InventoryError as e:
span.record_exception(e)
span.set_status(StatusCode.ERROR, str(e))
raise
Metrics with OTel
OTel metrics replace the need for manual Prometheus instrumentation in many cases:
from opentelemetry import metrics
meter = metrics.get_meter("payment-service")
# Counter — monotonically increasing
payment_counter = meter.create_counter(
"payments.total",
description="Total number of payment attempts",
unit="1"
)
# Histogram — for distributions (latency, sizes)
payment_duration = meter.create_histogram(
"payments.duration",
description="Payment processing duration",
unit="ms"
)
# Gauge — current value
active_sessions = meter.create_up_down_counter(
"sessions.active",
description="Currently active user sessions"
)
def process_payment(amount: float, method: str):
start = time.time()
try:
result = _do_payment(amount, method)
payment_counter.add(1, {"status": "success", "method": method})
return result
except Exception as e:
payment_counter.add(1, {"status": "error", "method": method})
raise
finally:
duration_ms = (time.time() - start) * 1000
payment_duration.record(duration_ms, {"method": method})
The OTel Collector Configuration
The collector is configured via YAML. Here’s a production-ready example:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Batch for efficiency
batch:
timeout: 1s
send_batch_size: 1024
# Add resource attributes
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
# Sample high-volume, low-value traces
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 500}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}
# Redact PII from spans
attributes:
actions:
- key: user.email
action: delete
- key: payment.card_number
action: delete
exporters:
# Traces → Tempo
otlp/tempo:
endpoint: http://tempo:4317
tls:
insecure: true
# Metrics → Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
# Logs → Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, tail_sampling, attributes]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
Kubernetes Deployment with the OTel Operator
In Kubernetes, the OpenTelemetry Operator automates instrumentation injection:
# Auto-instrument all Python pods in a namespace
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector:4317
python:
env:
- name: OTEL_SERVICE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
---
# Annotate a deployment to inject instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-python: "true"
No code changes, no Dockerfile modifications — just an annotation.
Grafana: Connecting the Dots
With Grafana as your visualization layer:
- Explore traces in Tempo, filtered by service or error status
- Click a trace → see all spans with timing and attributes
- Click a log line in Loki → jump to the correlated trace in Tempo
- Metrics dashboards in Grafana with alerts to PagerDuty/Slack
The Grafana LGTM stack (Loki + Grafana + Tempo + Mimir) has become the self-hosted OTel golden path in 2026.
Common Pitfalls
- Not sampling — sending 100% of traces at scale will bankrupt you in storage costs. Use tail-based sampling.
- Missing context propagation — async queues (Kafka, SQS) need explicit context propagation headers or you’ll get broken trace trees.
- Too many high-cardinality attributes —
span.setAttribute("user.id", user_id)is fine;span.setAttribute("request.body", raw_body)is a metrics time-bomb. - Skipping the collector — sending directly from services to backends makes backend migration impossible without code changes.
Conclusion
OpenTelemetry in 2026 is the closest thing we have to a universal observability standard. The auto-instrumentation story is mature, the ecosystem is rich, and the vendor-neutral model means you own your data and your choices.
Start with auto-instrumentation to get immediate value, add manual spans for business context, and deploy the OTel Collector to stay flexible. Your future on-call self will thank you.
References:
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
