Observability in 2026: OpenTelemetry, eBPF Auto-Instrumentation, and the Three Pillars Evolved



Observability in 2026: OpenTelemetry, eBPF Auto-Instrumentation, and the Three Pillars Evolved

Observability has undergone a quiet revolution. The “three pillars” — logs, metrics, traces — are still relevant, but how we collect, correlate, and query them has changed fundamentally. OpenTelemetry has become the universal standard. eBPF is eliminating the need for code-level instrumentation. AI is transforming anomaly detection and root cause analysis.

This post covers the state of observability in 2026 and what it means for your platform.

Monitoring Dashboard Photo by Carlos Muza on Unsplash


The State of OpenTelemetry

OpenTelemetry (OTel) has graduated from “promising standard” to “de facto industry standard.” In 2026:

  • Every major cloud provider has first-class OTel support
  • Every major language has a stable OTel SDK
  • Every major observability vendor (Datadog, Honeycomb, Grafana, New Relic) accepts OTLP natively
  • The OpenTelemetry Collector has replaced proprietary agents in most modern stacks

The value of this standardization is enormous: instrument once, send anywhere.

Architecture: The OTel Collector Sidecar Pattern

┌─────────────────────────────────────────────────┐
│                    Pod                          │
│                                                 │
│  ┌────────────┐    OTLP    ┌─────────────────┐  │
│  │ App (SDK)  │──────────► │  OTel Collector │  │
│  └────────────┘            │   (Sidecar)     │  │
│                            └────────┬────────┘  │
└─────────────────────────────────────┼───────────┘
                                      │ OTLP/gRPC
                         ┌────────────▼────────────┐
                         │   OTel Collector        │
                         │   (Gateway/Aggregator)  │
                         └────────────┬────────────┘
                                      │
              ┌───────────────┬───────┴──────────┐
              ▼               ▼                  ▼
         Prometheus       Tempo/Jaeger       Loki/ES
         (Metrics)        (Traces)           (Logs)

The Collector handles batching, retrying, sampling, and routing — your application just emits OTLP.

Setting Up Auto-Instrumentation

For Node.js, zero-code instrumentation:

// instrument.js — loaded before your application
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4317',
    }),
    exportIntervalMillis: 10000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();
# Start your app with automatic instrumentation
node --require ./instrument.js app.js

This automatically traces HTTP requests, database queries, and cache operations — zero application code changes.


eBPF: Observability Without Code Changes

eBPF (extended Berkeley Packet Filter) has matured into a production-grade observability technology. It lets you run safe programs in the Linux kernel, capturing telemetry data at the system level without modifying applications.

What eBPF-Based Observability Can Capture

Application Layer:    ┌──────────────┐
                       │ Your App     │  ← No modification needed
                       └──────┬───────┘
                               │ syscalls
Kernel Layer:         ┌────────▼─────────────────────────────┐
                       │           eBPF Programs              │
                       │  • TCP connections & latency         │
                       │  • HTTP/gRPC request/response        │
                       │  • DNS queries                       │
                       │  • File I/O                          │
                       │  • Memory allocations                │
                       │  • CPU profiling (flame graphs)      │
                       └──────────────────────────────────────┘

Cilium/Hubble for Network Observability

# Enable Hubble (Cilium's eBPF-based network observability)
helm upgrade cilium cilium/cilium \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

Once deployed, you get automatic visibility into all network flows — no service mesh required:

# Observe live traffic
hubble observe --pod default/api-server --follow

# Output:
# May 14 12:34:01.234 [verdict: FORWARDED] 10.0.0.5:52341 -> 10.0.0.8:5432 TCP Flags: SYN
# May 14 12:34:01.235 [verdict: FORWARDED] 10.0.0.8:5432 -> 10.0.0.5:52341 TCP Flags: SYN, ACK
# May 14 12:34:01.460 [verdict: FORWARDED] 10.0.0.5:52341 -> 10.0.0.8:5432 postgres query

Parca/Pyroscope: Continuous Profiling

eBPF-based continuous profiling captures CPU flame graphs from all your services — in production, all the time, with near-zero overhead:

# Parca Agent DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: parca-agent
spec:
  selector:
    matchLabels:
      app: parca-agent
  template:
    spec:
      hostPID: true
      containers:
        - name: parca-agent
          image: ghcr.io/parca-dev/parca-agent:latest
          securityContext:
            privileged: true  # Required for eBPF
          args:
            - /bin/parca-agent
            - --node=$(NODE_NAME)
            - --remote-store-address=parca.parca.svc.cluster.local:7070

No JVM agents, no async profilers, no application changes — just instant flame graphs for any process.


The Four Signals: Adding Profiles

The classic three pillars (logs, metrics, traces) have been joined by a fourth: profiles.

SignalWhat it answersTool
MetricsIs the system healthy?Prometheus, OTel Metrics
LogsWhat happened?Loki, ELK, CloudWatch
TracesWhere did latency come from?Tempo, Jaeger, Honeycomb
ProfilesWhy is this using CPU/memory?Parca, Pyroscope, Polar Signals

The Grafana stack now unifies all four: Grafana + Prometheus + Loki + Tempo + Pyroscope, all queried from a single UI with correlation links between signals.


Correlating Signals: The Magic of Trace IDs Everywhere

The real power of modern observability is correlation. A trace ID should flow through:

// Middleware: inject trace ID into every log
app.use((req, res, next) => {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;
  
  req.log = logger.child({ traceId, requestId: req.id });
  next();
});

// Every log now has the trace ID
req.log.info({ userId, action: "checkout" }, "User initiated checkout");
// Output: {"level":"info","traceId":"abc123...","requestId":"req_456","userId":"usr_789","action":"checkout","msg":"User initiated checkout"}

Now in Grafana, you can:

  1. Find a slow trace in Tempo
  2. Click “Logs” → filtered automatically by trace ID
  3. Click “Profiles” → shows CPU breakdown for that time window
  4. See the full picture in one workflow

SLOs: Connecting Telemetry to Business Value

Service Level Objectives are how you translate raw metrics into reliability commitments:

# OpenSLO specification
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-api-availability
  namespace: ecommerce
spec:
  description: "Checkout API must be available 99.9% of the time"
  service: checkout-api
  indicator:
    spec:
      ratioMetric:
        counter: true
        good:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(http_requests_total{service="checkout-api",status!~"5.."}[5m]))
        total:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(http_requests_total{service="checkout-api"}[5m]))
  timeWindow:
    - duration: 28d
      isRolling: true
  objectives:
    - displayName: "Good"
      target: 0.999

With this definition, you can calculate:

  • Error budget remaining (how much can break before you miss the SLO)
  • Burn rate alerts (you’re consuming the error budget 14x faster than normal)
  • Toil reduction priorities (which SLO is most at risk?)

Sampling Strategies

At high throughput, storing every trace is expensive. Smart sampling is essential:

Head-Based Sampling (Decisions at Request Start)

// OTel Collector — sample 10% of traffic, but 100% of errors
processors:
  probabilistic_sampler:
    sampling_percentage: 10
  
  # Always sample error traces
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Tail-Based Sampling (Decisions After Request Completes)

Tail sampling lets you make sampling decisions after you know if a request was slow or errored — much smarter than head-based sampling.

The OTel Collector’s tail sampling processor buffers spans and decides based on the complete trace.


Practical Stack Recommendations

Self-Hosted (Cost-Conscious)

Metrics:  Victoria Metrics (Prometheus-compatible, more efficient)
Logs:     Grafana Loki (cost-efficient, label-based)
Traces:   Grafana Tempo (object storage, S3/GCS backend)
Profiles: Pyroscope (open source, OTel-native)
UI:       Grafana (unified)
Agent:    OTel Collector (DaemonSet)
Cost:     ~$50-200/month on cloud object storage for most teams

Managed (Ops-Minimal)

Grafana Cloud:  All four signals, generous free tier
Honeycomb:      Best-in-class trace analysis, $0.50/GB
Datadog:        Full featured, expensive at scale
Axiom:          Logs + traces, cost-efficient
New Relic:      Full stack, consumption-based pricing

Conclusion

Observability in 2026 is fundamentally better than it was five years ago. OpenTelemetry has eliminated vendor lock-in. eBPF has made instrumentation optional for many use cases. The four signals (metrics, logs, traces, profiles) correlate seamlessly in modern UIs.

The shift from “I hope I logged the right thing” to “I can see exactly what happened” is now achievable for teams of any size. The tooling is mature, the costs are manageable, and the patterns are well-established.

Invest in observability early. The first time you diagnose a production incident in minutes instead of hours, you’ll understand why.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)