OpenTelemetry in Production: The Complete Observability Stack for 2026



Why Observability Matters More Than Ever

Distributed systems are the norm. A single user request might touch dozens of microservices, spawn async jobs, and hit three different databases before a response is returned. When something goes wrong at 2 AM, you need to know exactly what happened — not guess.

OpenTelemetry (OTel) has become the industry-standard answer. In 2026, it’s no longer a “nice to have” — it’s the default instrumentation layer for every serious engineering team.

This guide covers the complete OTel stack: what it is, how it works, and how to deploy it in production without drowning in YAML.

Dashboard with monitoring graphs Photo by Carlos Muza on Unsplash


What Is OpenTelemetry?

OpenTelemetry is a vendor-neutral, open-source observability framework that unifies the three pillars of observability:

  • Traces — record the path a request takes through your system
  • Metrics — quantitative measurements over time (latency, error rate, throughput)
  • Logs — structured event records correlated with traces

The key word is vendor-neutral. Instrument your code once with OTel SDKs, then send data to any backend: Jaeger, Zipkin, Tempo, Prometheus, Datadog, Honeycomb, New Relic — your choice, switchable without code changes.


The OTel Architecture

Your Services
     │ (OTel SDK)
     ▼
OTel Collector (agent or gateway)
     │
     ├── Prometheus (metrics)
     ├── Tempo / Jaeger (traces)
     └── Loki / Elasticsearch (logs)
            │
            ▼
         Grafana (dashboards + alerts)

The OTel Collector is the linchpin. It receives telemetry from your services, processes/transforms it, and exports to backends. Running a collector means your services never talk directly to backends — swapping backends is a collector config change, not a code change.


Automatic Instrumentation (Zero-Code)

For most common frameworks, OTel provides automatic instrumentation that requires no code changes:

Python

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install

# Run your app with auto-instrumentation
opentelemetry-instrument \
  --service-name my-python-app \
  --exporter-otlp-endpoint http://localhost:4317 \
  python app.py

Node.js

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

# instrumentation.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'my-node-service',
});

sdk.start();

Run with:

node --require ./instrumentation.js server.js

Automatic instrumentation captures:

  • HTTP client/server spans
  • Database queries (PostgreSQL, MySQL, MongoDB, Redis)
  • gRPC calls
  • Message queue operations (Kafka, RabbitMQ)

Manual Instrumentation: Adding Business Context

Auto-instrumentation covers infrastructure. But the most valuable traces include business context — what user made the request, what product they were buying, why a decision was made.

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("order-service")

def process_order(order_id: str, user_id: str):
    with tracer.start_as_current_span("process_order") as span:
        # Add business attributes to the span
        span.set_attribute("order.id", order_id)
        span.set_attribute("user.id", user_id)
        span.set_attribute("order.region", get_user_region(user_id))
        
        try:
            # Validate inventory
            with tracer.start_as_current_span("check_inventory"):
                items = validate_inventory(order_id)
                span.set_attribute("order.item_count", len(items))
            
            # Charge payment
            with tracer.start_as_current_span("charge_payment") as payment_span:
                result = charge_payment(order_id)
                payment_span.set_attribute("payment.method", result.method)
                payment_span.set_attribute("payment.amount_cents", result.amount)
            
            span.set_status(StatusCode.OK)
            return {"status": "success"}
            
        except InventoryError as e:
            span.record_exception(e)
            span.set_status(StatusCode.ERROR, str(e))
            raise

Metrics with OTel

OTel metrics replace the need for manual Prometheus instrumentation in many cases:

from opentelemetry import metrics

meter = metrics.get_meter("payment-service")

# Counter — monotonically increasing
payment_counter = meter.create_counter(
    "payments.total",
    description="Total number of payment attempts",
    unit="1"
)

# Histogram — for distributions (latency, sizes)
payment_duration = meter.create_histogram(
    "payments.duration",
    description="Payment processing duration",
    unit="ms"
)

# Gauge — current value
active_sessions = meter.create_up_down_counter(
    "sessions.active",
    description="Currently active user sessions"
)

def process_payment(amount: float, method: str):
    start = time.time()
    try:
        result = _do_payment(amount, method)
        payment_counter.add(1, {"status": "success", "method": method})
        return result
    except Exception as e:
        payment_counter.add(1, {"status": "error", "method": method})
        raise
    finally:
        duration_ms = (time.time() - start) * 1000
        payment_duration.record(duration_ms, {"method": method})

The OTel Collector Configuration

The collector is configured via YAML. Here’s a production-ready example:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Batch for efficiency
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # Add resource attributes
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert
  
  # Sample high-volume, low-value traces
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 500}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
  
  # Redact PII from spans
  attributes:
    actions:
      - key: user.email
        action: delete
      - key: payment.card_number
        action: delete

exporters:
  # Traces → Tempo
  otlp/tempo:
    endpoint: http://tempo:4317
    tls:
      insecure: true
  
  # Metrics → Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
  
  # Logs → Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, tail_sampling, attributes]
      exporters: [otlp/tempo]
    
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

Kubernetes Deployment with the OTel Operator

In Kubernetes, the OpenTelemetry Operator automates instrumentation injection:

# Auto-instrument all Python pods in a namespace
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector:4317
  python:
    env:
      - name: OTEL_SERVICE_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.labels['app']
---
# Annotate a deployment to inject instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "true"

No code changes, no Dockerfile modifications — just an annotation.


Grafana: Connecting the Dots

With Grafana as your visualization layer:

  1. Explore traces in Tempo, filtered by service or error status
  2. Click a trace → see all spans with timing and attributes
  3. Click a log line in Loki → jump to the correlated trace in Tempo
  4. Metrics dashboards in Grafana with alerts to PagerDuty/Slack

The Grafana LGTM stack (Loki + Grafana + Tempo + Mimir) has become the self-hosted OTel golden path in 2026.


Common Pitfalls

  1. Not sampling — sending 100% of traces at scale will bankrupt you in storage costs. Use tail-based sampling.
  2. Missing context propagation — async queues (Kafka, SQS) need explicit context propagation headers or you’ll get broken trace trees.
  3. Too many high-cardinality attributesspan.setAttribute("user.id", user_id) is fine; span.setAttribute("request.body", raw_body) is a metrics time-bomb.
  4. Skipping the collector — sending directly from services to backends makes backend migration impossible without code changes.

Conclusion

OpenTelemetry in 2026 is the closest thing we have to a universal observability standard. The auto-instrumentation story is mature, the ecosystem is rich, and the vendor-neutral model means you own your data and your choices.

Start with auto-instrumentation to get immediate value, add manual spans for business context, and deploy the OTel Collector to stay flexible. Your future on-call self will thank you.


References:


이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)