OpenTelemetry: The Complete Guide to Modern Observability
on Opentelemetry, Observability, Monitoring, Tracing, Metrics, Devops, Sre
OpenTelemetry: The Complete Guide to Modern Observability
In the world of distributed systems, understanding what’s happening across your services is crucial. OpenTelemetry (OTel) has emerged as the industry standard for observability, providing a unified approach to traces, metrics, and logs.
Photo by Carlos Muza on Unsplash
What is OpenTelemetry?
OpenTelemetry is a vendor-neutral, open-source observability framework that provides:
- Traces: Follow requests across service boundaries
- Metrics: Measure system and application performance
- Logs: Contextual event data linked to traces
Why OpenTelemetry?
- Vendor Agnostic: Export to any backend (Jaeger, Zipkin, Datadog, etc.)
- Industry Standard: CNCF graduated project with wide adoption
- Unified SDK: One instrumentation for all telemetry types
- Auto-Instrumentation: Minimal code changes required
Core Concepts
Traces and Spans
A trace represents an end-to-end request journey. Each operation is a span:
Trace: user-checkout
├── Span: api-gateway (50ms)
│ ├── Span: auth-service (10ms)
│ └── Span: order-service (35ms)
│ ├── Span: inventory-check (15ms)
│ └── Span: payment-process (18ms)
└── Total: 50ms
Metrics Types
- Counter: Cumulative values (requests, errors)
- Gauge: Point-in-time values (CPU usage, queue size)
- Histogram: Distribution of values (latency buckets)
Getting Started with Python
Installation
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-exporter-otlp
Basic Setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
# Configure resource (identifies your service)
resource = Resource.create({
"service.name": "order-service",
"service.version": "1.0.0",
"deployment.environment": "production"
})
# Set up tracer provider
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Get tracer
tracer = trace.get_tracer(__name__)
Manual Instrumentation
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
span = trace.get_current_span()
# Add attributes
span.set_attribute("order.id", order_id)
span.set_attribute("order.type", "standard")
# Create child span
with tracer.start_as_current_span("validate_payment") as child:
child.set_attribute("payment.method", "credit_card")
result = validate_payment(order_id)
if not result:
child.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(PaymentError("Validation failed"))
return complete_order(order_id)
Photo by Kevin Ku on Unsplash
Auto-Instrumentation
The real power of OTel—automatic instrumentation with zero code changes:
# Install auto-instrumentation
pip install opentelemetry-distro
opentelemetry-bootstrap -a install
# Run your app with auto-instrumentation
opentelemetry-instrument \
--service_name order-service \
--exporter_otlp_endpoint localhost:4317 \
python app.py
This automatically instruments:
- HTTP frameworks (FastAPI, Flask, Django)
- Database clients (SQLAlchemy, psycopg2)
- HTTP clients (requests, httpx)
- Message queues (Celery, Kafka)
FastAPI Integration
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
app = FastAPI()
# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
# Instrument outgoing HTTP calls
HTTPXClientInstrumentor().instrument()
@app.get("/orders/{order_id}")
async def get_order(order_id: str):
# This span is automatically created by FastAPI instrumentation
# Add custom attributes
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
return {"order_id": order_id, "status": "completed"}
Metrics Implementation
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
# Set up metrics
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="localhost:4317"),
export_interval_millis=60000
)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter(__name__)
# Create instruments
request_counter = meter.create_counter(
name="http_requests_total",
description="Total HTTP requests",
unit="1"
)
latency_histogram = meter.create_histogram(
name="http_request_duration_seconds",
description="HTTP request latency",
unit="s"
)
active_connections = meter.create_up_down_counter(
name="active_connections",
description="Number of active connections"
)
# Use metrics
def handle_request(method: str, path: str):
start = time.time()
active_connections.add(1)
try:
result = process_request()
request_counter.add(1, {"method": method, "path": path, "status": "200"})
return result
except Exception as e:
request_counter.add(1, {"method": method, "path": path, "status": "500"})
raise
finally:
duration = time.time() - start
latency_histogram.record(duration, {"method": method, "path": path})
active_connections.add(-1)
Logging Integration
Connect your logs to traces:
import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
# Set up log provider
log_provider = LoggerProvider(resource=resource)
log_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="localhost:4317"))
)
set_logger_provider(log_provider)
# Add OTel handler to Python logging
handler = LoggingHandler(level=logging.INFO)
logging.getLogger().addHandler(handler)
# Logs are now correlated with traces!
logger = logging.getLogger(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
logger.info(f"Processing order {order_id}") # Linked to current span
# ...
OpenTelemetry Collector
The OTel Collector is a proxy that receives, processes, and exports telemetry:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
memory_limiter:
check_interval: 1s
limit_mib: 1000
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Best Practices
1. Semantic Conventions
Follow OTel semantic conventions for consistent attributes:
from opentelemetry.semconv.trace import SpanAttributes
span.set_attribute(SpanAttributes.HTTP_METHOD, "GET")
span.set_attribute(SpanAttributes.HTTP_URL, "/api/orders")
span.set_attribute(SpanAttributes.HTTP_STATUS_CODE, 200)
span.set_attribute(SpanAttributes.DB_SYSTEM, "postgresql")
2. Sampling Strategy
Control trace volume with sampling:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased
# Sample 10% of traces, but always sample if parent was sampled
sampler = ParentBased(root=TraceIdRatioBased(0.1))
provider = TracerProvider(sampler=sampler, resource=resource)
3. Context Propagation
Ensure trace context flows across services:
from opentelemetry.propagate import inject, extract
# Inject context into outgoing requests
headers = {}
inject(headers)
response = httpx.get("http://service-b/api", headers=headers)
# Extract context from incoming requests
context = extract(request.headers)
with tracer.start_as_current_span("handle", context=context):
# Process request
Conclusion
OpenTelemetry provides the foundation for modern observability. With unified instrumentation for traces, metrics, and logs, you can gain deep insights into your distributed systems without vendor lock-in.
Start with auto-instrumentation, add custom spans where needed, and export to your preferred backends. Your future self debugging a production incident will thank you.
Implementing observability in your stack? OpenTelemetry is the way forward!
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
