OpenTelemetry in Production: A Complete Observability Guide for 2026
on Opentelemetry, Observability, Distributed tracing, Monitoring, Devops, Cloudnative, Sre
OpenTelemetry in Production: A Complete Observability Guide for 2026
Observability has become a first-class engineering concern. With distributed systems spanning dozens of services across multiple clouds, understanding what is happening requires more than dashboards and alerts. OpenTelemetry (OTel) — now the second-most-active CNCF project after Kubernetes — has become the industry standard for collecting and exporting telemetry data. This guide shows you how to instrument, collect, and visualize at scale.
Photo by Carlos Muza on Unsplash
Why OpenTelemetry Won
Before OTel, every observability vendor had its own SDK. Switching from Datadog to Honeycomb meant re-instrumenting your entire codebase. OpenTelemetry solved this with:
- Vendor-neutral instrumentation — instrument once, export anywhere
- Unified data model — traces, metrics, and logs in one framework
- Auto-instrumentation — zero-code instrumentation for popular frameworks
- The Collector — a powerful pipeline component for data processing
The Three Pillars + Profiles
Traces
Distributed traces track a request’s journey across services. Each trace contains spans — units of work with timing, attributes, and status.
Metrics
Numerical measurements over time: counters, gauges, histograms. OTel metrics bridge the gap between Prometheus-style pull metrics and push-based systems.
Logs
Structured log records, now linkable to traces via trace ID correlation.
Profiles (2026 addition)
Continuous profiling — CPU, memory, goroutine — is now part of the OTel spec, enabling correlation between trace slowness and actual CPU hotspots.
Architecture: The OTel Collector Pipeline
The Collector is the heart of a production OTel setup:
Services → OTel SDK → OTel Collector → Backends
↕
(Collector can also
scrape Prometheus,
receive Jaeger, Zipkin)
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
resourcedetection:
detectors: [env, kubernetes, aws]
attributes:
actions:
- key: environment
value: production
action: insert
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 500
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/honeycomb:
endpoint: api.honeycomb.io:443
headers:
x-honeycomb-team: ${HONEYCOMB_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, attributes, tail_sampling, batch]
exporters: [otlp/tempo, otlp/honeycomb]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resourcedetection, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch]
exporters: [loki]
Instrumenting a Python Service
Auto-Instrumentation (recommended starting point)
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run your app with auto-instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py
This automatically instruments Flask, FastAPI, Django, SQLAlchemy, Redis, boto3, and 80+ other libraries.
Manual Instrumentation for Custom Business Logic
from opentelemetry import trace, metrics
from opentelemetry.trace import Status, StatusCode
import time
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# Custom metrics
payment_counter = meter.create_counter(
"payments.processed",
description="Total payments processed",
unit="1",
)
payment_duration = meter.create_histogram(
"payments.duration",
description="Payment processing duration",
unit="ms",
)
payment_amount = meter.create_histogram(
"payments.amount",
description="Payment amounts",
unit="USD",
)
def process_payment(payment_request: dict) -> dict:
with tracer.start_as_current_span("process_payment") as span:
span.set_attributes({
"payment.currency": payment_request["currency"],
"payment.method": payment_request["method"],
"payment.amount_cents": payment_request["amount_cents"],
})
start_time = time.time()
try:
# Validate
with tracer.start_as_current_span("validate_payment"):
result = validate(payment_request)
span.set_attribute("payment.validation_passed", True)
# Charge
with tracer.start_as_current_span("charge_payment") as charge_span:
charge_span.set_attribute("payment.gateway", "stripe")
charge_result = charge_stripe(payment_request)
charge_span.set_attribute("payment.transaction_id", charge_result["id"])
# Record metrics
duration_ms = (time.time() - start_time) * 1000
payment_counter.add(1, {
"currency": payment_request["currency"],
"status": "success",
})
payment_duration.record(duration_ms, {
"currency": payment_request["currency"],
})
payment_amount.record(
payment_request["amount_cents"] / 100,
{"currency": payment_request["currency"]}
)
span.set_status(Status(StatusCode.OK))
return charge_result
except PaymentDeclinedError as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
payment_counter.add(1, {
"currency": payment_request["currency"],
"status": "declined",
})
raise
Instrumenting a Java Service (Spring Boot)
<!-- pom.xml -->
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>2.12.0</version>
</dependency>
# application.yml
otel:
service:
name: order-service
exporter:
otlp:
endpoint: http://otel-collector:4317
traces:
sampler: parentbased_traceidratio
sampler:
arg: "0.1"
instrumentation:
micrometer:
enabled: true # Bridge Spring Actuator metrics to OTel
@Service
public class OrderService {
private final Tracer tracer;
private final LongCounter orderCounter;
private final DoubleHistogram orderValueHistogram;
public OrderService(OpenTelemetry openTelemetry) {
this.tracer = openTelemetry.getTracer("order-service");
Meter meter = openTelemetry.getMeter("order-service");
this.orderCounter = meter.counterBuilder("orders.created")
.setDescription("Number of orders created")
.build();
this.orderValueHistogram = meter.histogramBuilder("orders.value")
.setDescription("Order value in USD")
.setUnit("USD")
.build();
}
public Order createOrder(CreateOrderRequest request) {
Span span = tracer.spanBuilder("create_order")
.setAttribute("order.user_id", request.getUserId())
.setAttribute("order.item_count", request.getItems().size())
.startSpan();
try (Scope scope = span.makeCurrent()) {
Order order = processOrder(request);
span.setAttribute("order.id", order.getId());
span.setAttribute("order.total_usd", order.getTotalUsd());
orderCounter.add(1, Attributes.of(
AttributeKey.stringKey("status"), "success",
AttributeKey.stringKey("region"), request.getRegion()
));
orderValueHistogram.record(order.getTotalUsd());
return order;
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR, e.getMessage());
throw e;
} finally {
span.end();
}
}
}
The Grafana Stack: Open Source Observability Backend
The most common open-source backend for OTel data in 2026:
# docker-compose.yml for local development
version: '3.8'
services:
grafana:
image: grafana/grafana:11.4.0
ports:
- "3000:3000"
environment:
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary
prometheus:
image: prom/prometheus:v2.52.0
ports:
- "9090:9090"
tempo:
image: grafana/tempo:2.6.0
ports:
- "4317:4317" # OTel gRPC
- "3200:3200" # Query API
loki:
image: grafana/loki:3.2.0
ports:
- "3100:3100"
otel-collector:
image: otel/opentelemetry-collector-contrib:0.116.0
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317"
- "4318:4318"
Correlating Traces, Metrics, and Logs
The power of OTel is exemplary correlation. In Grafana, you can:
- See a spike in
payments.durationhistogram - Click “View Traces” → see the slow traces during that window
- Click a trace → see spans and find the slow SQL query
- Click “View Logs” → see the logs from that exact request
This requires linking logs to traces via trace context injection:
import logging
from opentelemetry import trace
class OtelFormatter(logging.Formatter):
def format(self, record):
span = trace.get_current_span()
if span.is_recording():
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, '032x')
record.span_id = format(ctx.span_id, '016x')
record.trace_flags = ctx.trace_flags
else:
record.trace_id = "00000000000000000000000000000000"
record.span_id = "0000000000000000"
record.trace_flags = 0
return super().format(record)
# Configure logger
handler = logging.StreamHandler()
handler.setFormatter(OtelFormatter(
'{"timestamp":"%(asctime)s","level":"%(levelname)s","message":"%(message)s",'
'"trace_id":"%(trace_id)s","span_id":"%(span_id)s"}'
))
Tail Sampling: Sample Smart, Not Random
Head-based sampling (randomly sampling X% of requests at the start) loses your most valuable traces — errors and slow requests are rare but critical. Tail sampling makes the sampling decision after the trace completes:
The OTel Collector’s tail sampling processor (shown in config above) lets you:
- Always sample errors — 100% of traces with ERROR spans
- Always sample slow requests — 100% of traces > 500ms
- Probabilistically sample the rest — 5–10% of successful fast traces
This dramatically reduces storage costs while guaranteeing you capture all anomalous behavior.
SLO Tracking with OTel Metrics
OpenSLO + OTel metrics = automated SLO dashboards:
# openslo/payment-slo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
name: payment-availability
spec:
service: payment-service
sloIndicator:
metadata:
name: payment-success-rate
spec:
type: Ratio
ratioMetric:
counter: true
good:
metricSource:
type: Prometheus
spec:
query: sum(rate(payments_processed_total{status="success"}[5m]))
total:
metricSource:
type: Prometheus
spec:
query: sum(rate(payments_processed_total[5m]))
objectives:
- displayName: "99.9% availability"
target: 0.999
timeSliceTarget: 0.95
timeSliceWindow: 1m
timeWindow:
- duration: 30d
isRolling: true
alertPolicies:
- fast-burn-alert
Key Takeaways
- Start with auto-instrumentation — get 80% of value with zero code changes
- Deploy the Collector — never send directly to backends; the Collector gives you flexibility
- Use tail sampling in production — head sampling wastes your most valuable data
- Correlate everything — link logs to traces via context propagation
- Define SLOs before dashboards — know what “good” looks like before you build alerts
OpenTelemetry is the foundation of modern observability. Instrument it right once, and you’ll never be blind to what’s happening in your system again. 🔭
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
