OpenTelemetry in 2026: The Observability Standard That Finally Won
in Devops / Observability / Cloud on Opentelemetry, Otel, Observability, Tracing, Metrics, Logs, Monitoring, Devops, Distributed-systems
Introduction
For years, the observability space was fragmented chaos. Want traces? Pick between Jaeger, Zipkin, or a vendor-specific SDK. Metrics? Prometheus or StatsD or something proprietary. Logs? Whatever you felt like. Every tool had different instrumentation libraries, different data formats, different export pipelines.
Then OpenTelemetry (OTel) arrived — and by 2026, it has definitively won. OTel is now the default instrumentation layer for virtually every new distributed system, and the migration from legacy systems is largely complete.
This post covers what OTel is, why it won, and — critically — how to implement it correctly in a production system.
Photo by Luke Chesser on Unsplash
What OpenTelemetry Actually Is
OTel is not a monitoring tool. It is not a database for your telemetry data. It is an instrumentation framework and wire protocol — it defines how to collect observability data from your applications and how to export it to wherever you want to analyze it.
The three pillars:
- Traces — Distributed request traces showing the journey of a request through your system
- Metrics — Numerical measurements over time (counters, gauges, histograms)
- Logs — Structured event records, correlated with traces and metrics via context propagation
The key architectural insight of OTel: separate instrumentation from backends. You instrument your code once with OTel, then route your data to any compatible backend — Grafana, Datadog, Honeycomb, Jaeger, Prometheus, whatever.
The OTel Architecture
┌─────────────────────────────────────────────────────────┐
│ Your Application │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ OTel SDK │ │ OTel SDK │ │ OTel SDK │ │
│ │ (Service A) │ │ (Service B) │ │ (Service C) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼─────────────────┼─────────────────┼──────────┘
│ │ │
└─────────────────┼─────────────────┘
│ OTLP (gRPC or HTTP)
▼
┌─────────────────────────┐
│ OTel Collector │
│ ┌──────────────────┐ │
│ │ Receivers │ │
│ │ Processors │ │
│ │ Exporters │ │
│ └──────────────────┘ │
└────────────┬────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
Grafana Datadog Honeycomb
Tempo/Prom APM (Traces)
The Collector Is Not Optional
One of the most common OTel mistakes: skipping the Collector and exporting directly from your SDK to a backend. The Collector is where you gain:
- Tail-based sampling — Sample based on whether a trace had errors, not randomly
- Data transformation — Redact PII, add environment tags, normalize attribute names
- Multi-export fan-out — Send the same data to multiple backends simultaneously
- Buffering and retry — Resilient delivery even if your backend is temporarily down
Instrumentation in Practice
Auto-Instrumentation (The Right Starting Point)
Most frameworks have zero-code OTel auto-instrumentation. Start here:
Node.js:
// instrumentation.js — loaded before your app
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'grpc://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'grpc://otel-collector:4317',
}),
exportIntervalMillis: 10000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
# Start your app with auto-instrumentation
node --require ./instrumentation.js app.js
This automatically instruments: HTTP calls, database queries (pg, mysql, mongodb), Redis, gRPC, message queues, and more. Zero code changes to your business logic.
Python:
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run with auto-instrumentation
opentelemetry-instrument \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app.py
Java (Spring Boot):
<!-- pom.xml -->
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>2.12.0</version>
</dependency>
# application.yml
otel:
exporter:
otlp:
endpoint: http://otel-collector:4317
service:
name: my-spring-service
Manual Instrumentation: Adding Business Context
Auto-instrumentation gets you infrastructure telemetry. For business-level context, you need manual spans:
import { trace, metrics, context, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service', '1.0.0');
const meter = metrics.getMeter('payment-service', '1.0.0');
// Business-level metrics
const paymentCounter = meter.createCounter('payment.processed', {
description: 'Number of payments processed',
});
const paymentDuration = meter.createHistogram('payment.duration', {
description: 'Payment processing duration in milliseconds',
unit: 'ms',
});
const paymentValue = meter.createHistogram('payment.value_usd', {
description: 'Payment value in USD',
unit: 'USD',
});
async function processPayment(order: Order): Promise<PaymentResult> {
return tracer.startActiveSpan('process-payment', async (span) => {
const startTime = Date.now();
// Add business attributes to the trace
span.setAttributes({
'payment.order_id': order.id,
'payment.amount_usd': order.totalUSD,
'payment.method': order.paymentMethod,
'payment.currency': order.currency,
'customer.tier': order.customer.tier,
});
try {
// Validate
await tracer.startActiveSpan('validate-payment', async (validateSpan) => {
await validatePaymentDetails(order);
validateSpan.end();
});
// Charge
const result = await tracer.startActiveSpan('charge-card', async (chargeSpan) => {
chargeSpan.setAttributes({
'payment.processor': 'stripe',
'payment.stripe_customer_id': order.customer.stripeId,
});
const res = await stripeClient.charges.create({
amount: Math.round(order.totalUSD * 100),
currency: 'usd',
customer: order.customer.stripeId,
});
chargeSpan.setAttributes({ 'payment.stripe_charge_id': res.id });
chargeSpan.end();
return res;
});
// Record metrics
const duration = Date.now() - startTime;
paymentCounter.add(1, {
'payment.method': order.paymentMethod,
'payment.status': 'success',
'customer.tier': order.customer.tier,
});
paymentDuration.record(duration, { 'payment.method': order.paymentMethod });
paymentValue.record(order.totalUSD, { 'customer.tier': order.customer.tier });
span.setStatus({ code: SpanStatusCode.OK });
span.end();
return { success: true, chargeId: result.id };
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
paymentCounter.add(1, {
'payment.method': order.paymentMethod,
'payment.status': 'error',
'payment.error_type': error.constructor.name,
});
span.end();
throw error;
}
});
}
The OTel Collector Configuration
The Collector is the most powerful and most underutilized part of the OTel stack:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Batch for efficiency
batch:
timeout: 10s
send_batch_size: 1024
# Memory protection
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
# Add environment context
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
- key: cloud.provider
value: aws
action: upsert
# Tail-based sampling — only sample slow/errored traces
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces-policy
type: latency
latency: { threshold_ms: 1000 }
- name: sample-otherwise
type: probabilistic
probabilistic: { sampling_percentage: 5 }
# Redact PII
transform:
trace_statements:
- context: span
statements:
# Redact credit card numbers from attributes
- replace_pattern(attributes["payment.card_number"], "\\d{12}(\\d{4})", "************$1")
# Remove auth tokens
- delete_key(attributes, "http.request.header.authorization")
exporters:
# Grafana Tempo for traces
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Prometheus for metrics
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
# Loki for logs
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
deployment.environment: "environment"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, tail_sampling, transform, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
Connecting Traces, Metrics, and Logs
The real power of OTel is correlation. With proper context propagation, you can jump from a slow trace → to the metric spike that caused it → to the log lines that explain it.
The key: always propagate context through async boundaries.
import { context, propagation, trace } from '@opentelemetry/api';
// Publishing to a message queue — carry the trace context
async function publishOrderEvent(order: Order) {
const span = trace.getActiveSpan();
const carrier: Record<string, string> = {};
// Inject current trace context into message headers
propagation.inject(context.active(), carrier);
await messageQueue.publish('order.created', {
data: order,
headers: carrier, // W3C TraceContext headers
});
}
// Consuming from the queue — restore the trace context
async function handleOrderEvent(message: QueueMessage) {
// Extract trace context from message headers
const parentContext = propagation.extract(context.active(), message.headers);
// Run handler within the extracted context — span will be a child of the original trace
await context.with(parentContext, async () => {
const tracer = trace.getTracer('order-processor');
await tracer.startActiveSpan('process-order-event', async (span) => {
await processOrder(message.data);
span.end();
});
});
}
Photo by Carlos Muza on Unsplash
Semantic Conventions: The Hidden Key to OTel Success
The most underappreciated part of OTel is the Semantic Conventions specification — standardized attribute names for common operations.
Instead of everyone using different attribute names for the same concepts:
db.queryvssql.queryvsdatabase.statement
OTel standardizes on:
db.statementfor database querieshttp.request.methodfor HTTP methodsnet.peer.namefor remote hostnamesrpc.systemfor RPC framework
Following semantic conventions means your dashboards, alerts, and queries work across services and languages without customization. This is what makes OTel backends like Grafana and Honeycomb so powerful — they can build generic UI because the data is consistently structured.
Production Checklist
Before going live with OTel, verify:
- Sampling is configured — Never send 100% of traces to production backends
- Tail-based sampling is used for production (not head-based)
- Memory limiter is configured in the Collector
- PII is redacted before traces leave your network
- Context propagation works across async boundaries (queues, scheduled jobs)
- Semantic conventions are followed for database, HTTP, and messaging spans
- Resource attributes (service.name, deployment.environment) are set
- Collector is deployed as a sidecar or DaemonSet (not shared across clusters)
- Exemplars are enabled (links from Prometheus metrics to Jaeger traces)
Conclusion
OpenTelemetry has done something rare in infrastructure: it created a true standard that the entire industry aligned on. Vendors compete on their analysis UI and query engines, not on proprietary instrumentation lock-in. Developers instrument once and switch backends freely.
The learning curve is real — OTel has a lot of moving parts. But the investment pays off immediately: better debugging, faster incident response, and the freedom to change your observability stack without re-instrumenting everything.
In 2026, “I don’t have observability” is a choice, not a technical limitation. OTel removed that excuse.
Resources
- OpenTelemetry Official Docs
- OTel Semantic Conventions
- OpenTelemetry Collector
- Grafana LGTM Stack
- Honeycomb OTel Guide
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
