OpenTelemetry in Production: End-to-End Observability for Microservices in 2026

Observability has crossed the chasm from “nice to have” to “table stakes” for production systems. But instrumenting dozens of microservices across multiple languages used to mean maintaining separate integrations for each observability backend. OpenTelemetry (OTel) changed that permanently. In 2026, OTel is the universal standard — vendor-neutral, runtime-agnostic, and stable across all three signal types: traces, metrics, and logs.

This guide walks you through setting up production-grade observability from scratch using OpenTelemetry with Grafana’s LGTM stack (Loki, Grafana, Tempo, Mimir).

Observability Dashboard Photo by Carlos Muza on Unsplash

The Three Pillars of Observability

Signal	What It Tells You	OTel Component
Traces	How a request flows through your system	OTLP traces → Tempo
Metrics	Aggregated system health over time	OTLP metrics → Mimir/Prometheus
Logs	Timestamped event records	OTLP logs → Loki

The OTel promise: instrument once, export anywhere.

Architecture Overview

Your Services (Node/Python/Java/Go)
    ↓ OTLP (gRPC/HTTP)
OTel Collector (sidecar or daemonset)
    ↓ ↓ ↓
  Tempo  Mimir  Loki
    ↓ ↓ ↓
    Grafana (unified UI)

The OTel Collector is the backbone: it receives telemetry from your services, transforms it, and routes it to the appropriate backends.

Setting Up the Collector

docker-compose.yml

version: "3.9"
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.115.0
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
      - "8888:8888"    # Collector metrics
      - "8889:8889"    # Prometheus scrape endpoint
    depends_on:
      - tempo
      - loki
      - mimir

  tempo:
    image: grafana/tempo:2.6.0
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"

  loki:
    image: grafana/loki:3.3.0
    command: ["-config.file=/etc/loki/local-config.yaml"]
    volumes:
      - loki-data:/loki
    ports:
      - "3100:3100"

  mimir:
    image: grafana/mimir:2.14.0
    command: ["--config.file=/etc/mimir/mimir.yaml"]
    volumes:
      - ./mimir-config.yaml:/etc/mimir/mimir.yaml
      - mimir-data:/data
    ports:
      - "9009:9009"

  grafana:
    image: grafana/grafana:11.4.0
    environment:
      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary
    volumes:
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"

volumes:
  tempo-data:
  loki-data:
  mimir-data:
  grafana-data:

otel-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  # Add resource attributes to all signals
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert

  # Memory limiter prevents OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100

  # Tail sampling: keep 100% of errors, 10% of success
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces-policy
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sampling-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true

  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]

Instrumenting a Node.js Service

Auto-Instrumentation (Zero Code Changes)

// tracing.ts — import BEFORE anything else
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-grpc";
import { OTLPLogExporter } from "@opentelemetry/exporter-logs-otlp-grpc";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { BatchLogRecordProcessor } from "@opentelemetry/sdk-logs";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "order-service",
    [ATTR_SERVICE_VERSION]: "2.1.0",
    "deployment.environment": process.env.NODE_ENV ?? "development",
  }),

  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4317",
  }),

  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: "http://otel-collector:4317",
    }),
    exportIntervalMillis: 10_000,
  }),

  logRecordProcessor: new BatchLogRecordProcessor(
    new OTLPLogExporter({ url: "http://otel-collector:4317" })
  ),

  // Auto-instruments: express, http, pg, redis, grpc, and 40+ more
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-http": {
        ignoreIncomingRequestHook: (req) => req.url === "/health",
      },
      "@opentelemetry/instrumentation-pg": {
        enhancedDatabaseReporting: true, // Include SQL queries in spans
      },
    }),
  ],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown());

# package.json start command
node -r ./tracing.js dist/server.js

That’s it. Express routes, database queries, outbound HTTP calls, and Redis operations are all automatically traced.

Manual Instrumentation for Business Logic

import { trace, metrics, context, propagation } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");
const meter = metrics.getMeter("order-service");

// Custom metrics
const orderCounter = meter.createCounter("orders.created", {
  description: "Number of orders created",
  unit: "{order}",
});

const processingTime = meter.createHistogram("order.processing.duration", {
  description: "Time to process an order",
  unit: "ms",
});

export async function processOrder(orderId: string, items: OrderItem[]) {
  // Create a custom span
  return tracer.startActiveSpan("processOrder", async (span) => {
    try {
      // Add semantic attributes
      span.setAttribute("order.id", orderId);
      span.setAttribute("order.item_count", items.length);
      span.setAttribute("order.total_value", calculateTotal(items));

      const start = performance.now();

      // Validate inventory
      span.addEvent("inventory_check_started");
      await checkInventory(items);
      span.addEvent("inventory_check_completed");

      // Create order in DB
      const order = await db.orders.create({ orderId, items });
      
      // Record metrics
      const duration = performance.now() - start;
      orderCounter.add(1, { "order.status": "success", "order.region": "us-east" });
      processingTime.record(duration, { "order.item_count": items.length });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      // Record error details
      span.recordException(error as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      orderCounter.add(1, { "order.status": "error" });
      throw error;
    } finally {
      span.end();
    }
  });
}

Structured Logging with Trace Correlation

The real power of OTel comes from correlating logs with traces. When a log message includes trace_id and span_id, Grafana can link directly from a log line to the trace that produced it.

import { logs, SeverityNumber } from "@opentelemetry/api-logs";
import { context, trace } from "@opentelemetry/api";

const logger = logs.getLogger("order-service");

function log(
  severity: SeverityNumber,
  message: string,
  attributes?: Record<string, string | number | boolean>
) {
  const activeSpan = trace.getActiveSpan();
  const spanContext = activeSpan?.spanContext();

  logger.emit({
    severityNumber: severity,
    severityText: SeverityNumber[severity],
    body: message,
    attributes: {
      ...attributes,
      // Auto-correlate with active trace
      "trace.id": spanContext?.traceId,
      "span.id": spanContext?.spanId,
    },
  });
}

// Usage
log(SeverityNumber.INFO, "Order created successfully", {
  "order.id": order.id,
  "user.id": userId,
});

Grafana Dashboards and Alerts

Provisioned Datasource Configuration

# grafana/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: mimir
      
  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace.id":"(\w+)"'
          url: "$${__value.raw}"
          datasourceUid: tempo
          
  - name: Mimir
    type: prometheus
    url: http://mimir:9009/prometheus

Key Alerts to Configure

# Grafana alert rule (via API or dashboard)
groups:
  - name: order-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_request_duration_seconds_count{
            job="order-service",
            http_response_status_code=~"5.."
          }[5m])) /
          sum(rate(http_server_request_duration_seconds_count{
            job="order-service"
          }[5m])) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 1% for order-service"
          
      - alert: SlowP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_server_request_duration_seconds_bucket{
              job="order-service"
            }[5m])) by (le, http_route)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency > 2s on "

Python Service Instrumentation

# tracing.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor

def setup_telemetry(service_name: str):
    # Traces
    tracer_provider = TracerProvider(
        resource=Resource.create({
            SERVICE_NAME: service_name,
            DEPLOYMENT_ENVIRONMENT: os.getenv("ENVIRONMENT", "dev"),
        })
    )
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
    )
    trace.set_tracer_provider(tracer_provider)

    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://otel-collector:4317"),
        export_interval_millis=10_000,
    )
    meter_provider = MeterProvider(
        resource=tracer_provider.resource,
        metric_readers=[metric_reader],
    )
    metrics.set_meter_provider(meter_provider)

    # Auto-instrumentations
    FastAPIInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()

Production Checklist

Sampling configured: Don’t send 100% of traces in high-traffic environments
Sensitive data scrubbed: Remove PII from span attributes (credit card numbers, passwords)
Collector resource limits set: Memory limiter prevents cascade failures
Trace context propagated: traceparent header forwarded through all service calls
Baseline alerts defined: Error rate, latency P99, saturation
Log-trace correlation enabled: trace_id in structured logs
Retention policies configured: Don’t keep 13 months of raw traces
Cost monitored: High-cardinality metrics can blow up storage costs

Conclusion

OpenTelemetry in 2026 has reached a level of maturity that makes “observability as an afterthought” inexcusable. The collector-based architecture means you can switch backends without changing application code. Auto-instrumentation handles the boilerplate, and manual instrumentation fills in the business-logic gaps. With the LGTM stack, you get a fully integrated observability platform at near-zero licensing cost. Start with auto-instrumentation, add custom metrics for your key business events, configure alert thresholds for your SLOs, and you’ll have production-grade observability in a day.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)