OpenTelemetry in Production: A Complete Observability Guide for 2026



OpenTelemetry in Production: A Complete Observability Guide for 2026

Observability has become a first-class engineering concern. With distributed systems spanning dozens of services across multiple clouds, understanding what is happening requires more than dashboards and alerts. OpenTelemetry (OTel) — now the second-most-active CNCF project after Kubernetes — has become the industry standard for collecting and exporting telemetry data. This guide shows you how to instrument, collect, and visualize at scale.

Observability Dashboard Photo by Carlos Muza on Unsplash


Why OpenTelemetry Won

Before OTel, every observability vendor had its own SDK. Switching from Datadog to Honeycomb meant re-instrumenting your entire codebase. OpenTelemetry solved this with:

  1. Vendor-neutral instrumentation — instrument once, export anywhere
  2. Unified data model — traces, metrics, and logs in one framework
  3. Auto-instrumentation — zero-code instrumentation for popular frameworks
  4. The Collector — a powerful pipeline component for data processing

The Three Pillars + Profiles

Traces

Distributed traces track a request’s journey across services. Each trace contains spans — units of work with timing, attributes, and status.

Metrics

Numerical measurements over time: counters, gauges, histograms. OTel metrics bridge the gap between Prometheus-style pull metrics and push-based systems.

Logs

Structured log records, now linkable to traces via trace ID correlation.

Profiles (2026 addition)

Continuous profiling — CPU, memory, goroutine — is now part of the OTel spec, enabling correlation between trace slowness and actual CPU hotspots.


Architecture: The OTel Collector Pipeline

The Collector is the heart of a production OTel setup:

Services → OTel SDK → OTel Collector → Backends
                            ↕
                    (Collector can also
                     scrape Prometheus,
                     receive Jaeger, Zipkin)
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
  resourcedetection:
    detectors: [env, kubernetes, aws]
  attributes:
    actions:
      - key: environment
        value: production
        action: insert
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/honeycomb:
    endpoint: api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${HONEYCOMB_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, attributes, tail_sampling, batch]
      exporters: [otlp/tempo, otlp/honeycomb]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [loki]

Instrumenting a Python Service

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run your app with auto-instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py

This automatically instruments Flask, FastAPI, Django, SQLAlchemy, Redis, boto3, and 80+ other libraries.

Manual Instrumentation for Custom Business Logic

from opentelemetry import trace, metrics
from opentelemetry.trace import Status, StatusCode
import time

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Custom metrics
payment_counter = meter.create_counter(
    "payments.processed",
    description="Total payments processed",
    unit="1",
)
payment_duration = meter.create_histogram(
    "payments.duration",
    description="Payment processing duration",
    unit="ms",
)
payment_amount = meter.create_histogram(
    "payments.amount",
    description="Payment amounts",
    unit="USD",
)

def process_payment(payment_request: dict) -> dict:
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attributes({
            "payment.currency": payment_request["currency"],
            "payment.method": payment_request["method"],
            "payment.amount_cents": payment_request["amount_cents"],
        })

        start_time = time.time()

        try:
            # Validate
            with tracer.start_as_current_span("validate_payment"):
                result = validate(payment_request)
                span.set_attribute("payment.validation_passed", True)

            # Charge
            with tracer.start_as_current_span("charge_payment") as charge_span:
                charge_span.set_attribute("payment.gateway", "stripe")
                charge_result = charge_stripe(payment_request)
                charge_span.set_attribute("payment.transaction_id", charge_result["id"])

            # Record metrics
            duration_ms = (time.time() - start_time) * 1000
            payment_counter.add(1, {
                "currency": payment_request["currency"],
                "status": "success",
            })
            payment_duration.record(duration_ms, {
                "currency": payment_request["currency"],
            })
            payment_amount.record(
                payment_request["amount_cents"] / 100,
                {"currency": payment_request["currency"]}
            )

            span.set_status(Status(StatusCode.OK))
            return charge_result

        except PaymentDeclinedError as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            payment_counter.add(1, {
                "currency": payment_request["currency"],
                "status": "declined",
            })
            raise

Instrumenting a Java Service (Spring Boot)

<!-- pom.xml -->
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
    <version>2.12.0</version>
</dependency>
# application.yml
otel:
  service:
    name: order-service
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  traces:
    sampler: parentbased_traceidratio
    sampler:
      arg: "0.1"
  instrumentation:
    micrometer:
      enabled: true  # Bridge Spring Actuator metrics to OTel
@Service
public class OrderService {

    private final Tracer tracer;
    private final LongCounter orderCounter;
    private final DoubleHistogram orderValueHistogram;

    public OrderService(OpenTelemetry openTelemetry) {
        this.tracer = openTelemetry.getTracer("order-service");
        Meter meter = openTelemetry.getMeter("order-service");

        this.orderCounter = meter.counterBuilder("orders.created")
            .setDescription("Number of orders created")
            .build();

        this.orderValueHistogram = meter.histogramBuilder("orders.value")
            .setDescription("Order value in USD")
            .setUnit("USD")
            .build();
    }

    public Order createOrder(CreateOrderRequest request) {
        Span span = tracer.spanBuilder("create_order")
            .setAttribute("order.user_id", request.getUserId())
            .setAttribute("order.item_count", request.getItems().size())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            Order order = processOrder(request);
            span.setAttribute("order.id", order.getId());
            span.setAttribute("order.total_usd", order.getTotalUsd());

            orderCounter.add(1, Attributes.of(
                AttributeKey.stringKey("status"), "success",
                AttributeKey.stringKey("region"), request.getRegion()
            ));
            orderValueHistogram.record(order.getTotalUsd());

            return order;
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            throw e;
        } finally {
            span.end();
        }
    }
}

The Grafana Stack: Open Source Observability Backend

The most common open-source backend for OTel data in 2026:

# docker-compose.yml for local development
version: '3.8'
services:
  grafana:
    image: grafana/grafana:11.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary

  prometheus:
    image: prom/prometheus:v2.52.0
    ports:
      - "9090:9090"

  tempo:
    image: grafana/tempo:2.6.0
    ports:
      - "4317:4317"  # OTel gRPC
      - "3200:3200"  # Query API

  loki:
    image: grafana/loki:3.2.0
    ports:
      - "3100:3100"

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.116.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"

Correlating Traces, Metrics, and Logs

The power of OTel is exemplary correlation. In Grafana, you can:

  1. See a spike in payments.duration histogram
  2. Click “View Traces” → see the slow traces during that window
  3. Click a trace → see spans and find the slow SQL query
  4. Click “View Logs” → see the logs from that exact request

This requires linking logs to traces via trace context injection:

import logging
from opentelemetry import trace

class OtelFormatter(logging.Formatter):
    def format(self, record):
        span = trace.get_current_span()
        if span.is_recording():
            ctx = span.get_span_context()
            record.trace_id = format(ctx.trace_id, '032x')
            record.span_id = format(ctx.span_id, '016x')
            record.trace_flags = ctx.trace_flags
        else:
            record.trace_id = "00000000000000000000000000000000"
            record.span_id = "0000000000000000"
            record.trace_flags = 0
        return super().format(record)

# Configure logger
handler = logging.StreamHandler()
handler.setFormatter(OtelFormatter(
    '{"timestamp":"%(asctime)s","level":"%(levelname)s","message":"%(message)s",'
    '"trace_id":"%(trace_id)s","span_id":"%(span_id)s"}'
))

Tail Sampling: Sample Smart, Not Random

Head-based sampling (randomly sampling X% of requests at the start) loses your most valuable traces — errors and slow requests are rare but critical. Tail sampling makes the sampling decision after the trace completes:

The OTel Collector’s tail sampling processor (shown in config above) lets you:

  • Always sample errors — 100% of traces with ERROR spans
  • Always sample slow requests — 100% of traces > 500ms
  • Probabilistically sample the rest — 5–10% of successful fast traces

This dramatically reduces storage costs while guaranteeing you capture all anomalous behavior.


SLO Tracking with OTel Metrics

OpenSLO + OTel metrics = automated SLO dashboards:

# openslo/payment-slo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: payment-availability
spec:
  service: payment-service
  sloIndicator:
    metadata:
      name: payment-success-rate
    spec:
      type: Ratio
      ratioMetric:
        counter: true
        good:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(payments_processed_total{status="success"}[5m]))
        total:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(payments_processed_total[5m]))
  objectives:
    - displayName: "99.9% availability"
      target: 0.999
      timeSliceTarget: 0.95
      timeSliceWindow: 1m
  timeWindow:
    - duration: 30d
      isRolling: true
  alertPolicies:
    - fast-burn-alert

Key Takeaways

  1. Start with auto-instrumentation — get 80% of value with zero code changes
  2. Deploy the Collector — never send directly to backends; the Collector gives you flexibility
  3. Use tail sampling in production — head sampling wastes your most valuable data
  4. Correlate everything — link logs to traces via context propagation
  5. Define SLOs before dashboards — know what “good” looks like before you build alerts

OpenTelemetry is the foundation of modern observability. Instrument it right once, and you’ll never be blind to what’s happening in your system again. 🔭

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)