Distributed Tracing with OpenTelemetry

How I rolled out OpenTelemetry across NestJS and Rails services, with W3C context propagation, OTLP, sampling that doesn't lie, and trace-log correlation for SLI-driven debugging.

The first time OpenTelemetry actually paid for itself, I was staring at a frozen public leaderboard during a live combat-sports broadcast at the federation platform I CTO’d in London. Standings stopped updating at 14:32 local. Pages went out within two minutes. The Kafka consumer group for standings-projector was rebalancing every thirty seconds and we couldn’t tell which pod was the bad citizen from the broker side alone. The thing that ended the incident wasn’t a Kafka tool. It was a trace, end to end, from the HTTP request that pushed a match event all the way through to the projector handler on the slow pod. That moment is the reason I now treat OTel as table stakes on any system with more than two services.

This post is the rollout I’d do today, sharpened by every place I got it wrong the first time.

Why OTel, not a vendor agent

I’ve shipped vendor agents (Datadog APM, New Relic) in plenty of services. They work. They also bind you to one backend, one tagging convention, and one billing model. The thing I like about OTel is the boring part: it’s a wire format and a set of SDKs. You instrument once, you export OTLP, and you swap the backend later without rewriting code. At the creator-economy platform I worked at most recently, we had services exporting to Datadog and to a Grafana Tempo cluster in parallel for a quarter while we evaluated. That migration would have been months with a vendor agent in place. With OTel it was an env var.

So: OTel SDK in the service, OTLP over gRPC to a Collector, Collector forwards to whatever your backend is. Don’t ship to your APM vendor directly from the app. The Collector is non-negotiable.

Auto-instrumentation in NestJS

For Node services I start with the auto-instrumentations and only reach for manual spans when I need them. The boot file is short and goes before import of anything app-level.

// otel.ts (imported first in main.ts, before AppModule)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? 'unknown',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA ?? 'dev',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'dev',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4317',
  }),
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(Number(process.env.OTEL_SAMPLE_RATIO ?? '0.05')),
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-fs': { enabled: false },
  })],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown().catch(() => process.exit(1));
});

Two things worth flagging here. First, ParentBasedSampler is the only sampler that gives you sane behavior in a distributed system. The frontend or gateway decides, downstream services honor that decision. If you sample independently in every service, you’ll get traces missing the middle three hops and you’ll waste a week wondering why. Second, kill the filesystem instrumentation. It’s loud and useless for most services.

Rails side, same trace

For the Rails monoliths I’ve shipped against, the Ruby OTel auto-instrumentations cover Rack, ActiveRecord, Net::HTTP, Sidekiq, and faraday out of the box. Initializer goes in an early loader.

# config/initializers/opentelemetry.rb
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
require 'opentelemetry/instrumentation/all'

OpenTelemetry::SDK.configure do |c|
  c.service_name = ENV.fetch('OTEL_SERVICE_NAME', 'rails-app')
  c.service_version = ENV.fetch('GIT_SHA', 'dev')

  c.use_all(
    'OpenTelemetry::Instrumentation::ActiveRecord' => { db_statement: :include },
    'OpenTelemetry::Instrumentation::Sidekiq' => { span_naming: :job_class },
  )

  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
      OpenTelemetry::Exporter::OTLP::Exporter.new
    )
  )
end

The piece I always have to remember the first day is db_statement: :include. By default it strips SQL out of spans for PII reasons. Fair default. Awful for debugging an n+1 across a service boundary. Turn it on in non-prod environments at minimum.

W3C context, not vendor headers

This is where I see most teams trip. OTel uses the W3C traceparent header. Some older vendor agents inject x-datadog-trace-id or x-amzn-trace-id instead. If you have a mixed fleet, you need to enable multiple propagators so a trace started by an older service is picked up by a newer one.

// in otel.ts, before sdk.start()
import { propagation } from '@opentelemetry/api';
import { CompositePropagator } from '@opentelemetry/core';
import { W3CTraceContextPropagator, W3CBaggagePropagator } from '@opentelemetry/core';

propagation.setGlobalPropagator(
  new CompositePropagator({
    propagators: [
      new W3CTraceContextPropagator(),
      new W3CBaggagePropagator(),
    ],
  }),
);

For async transports it’s even more important. Kafka producers and consumers need explicit context injection and extraction. The auto-instrumentation for kafkajs does this, but if you’re using a thin custom wrapper, you have to wire it manually. Inject into headers on the producer side, extract on the consumer side, and remember that traceparent is plain ASCII so it travels fine in any broker that supports message headers.

The Collector matters more than the SDK

The Collector is where you decide what’s expensive and what’s not. Tail sampling, attribute scrubbing, batching, retries. Without it, your app pods are doing network calls on the hot path to your backend, and the first time the backend is slow your app is slow.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
  attributes/scrub:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: hash

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, attributes/scrub, batch]
      exporters: [otlp/tempo]

Tail sampling is the killer feature. Head sampling at 5% means you’ll miss 95% of slow requests. Tail sampling at 5% baseline plus errors plus latency means you keep every slow or failing trace and a representative sample of the rest. That ratio held up well at the scale of a multi-terabyte Aurora writer and thousands of pods. Your mileage may vary, but the principle is solid.

Correlating logs with traces

A trace without a log line is half a story. The trick is putting trace_id and span_id into your structured logger. In Node:

import { trace, context } from '@opentelemetry/api';
import pino from 'pino';

const logger = pino({
  base: undefined,
  mixin() {
    const span = trace.getSpan(context.active());
    if (!span) return {};
    const { traceId, spanId } = span.spanContext();
    return { trace_id: traceId, span_id: spanId };
  },
});

Now every log line emitted inside an active span carries the IDs. In Grafana or your log backend of choice, you can pivot from a log to its trace and back. That’s the move that turns observability from “I have data” into “I can answer the question.”

The story that made me believe

Back to the federation platform. Twelve minutes of stale standings during a live broadcast. Hundreds of microservices in flight, Kafka as the async backbone. The first wrong fix was kubectl rollout restart deployment/standings-projector. Consumers re-joined cleanly. Then they triggered another rebalance about forty seconds later. We were doing the same dance the group was already doing on its own.

The real fix came from pulling pod logs side by side. One pod out of six had a different max.poll.interval.ms, 300s on five of them, 60s on the sixth. The sixth pod was running a stale container image because someone had pushed a config-touching fix without bumping the image tag and the deployment had pulled :latest. The pod’s handler did a slow downstream call to a federation-rules service that occasionally took around seventy seconds, past its max.poll.interval.ms, so it kept getting kicked out of the group, causing rebalances for everyone. Cordoned the bad pod, storm drained in about ninety seconds.

Here’s the thing. If we’d had OTel spans on the consumer handler with messaging.kafka.consumer.group and service.instance.id as resource attributes, we’d have seen the bad pod in the trace view inside two minutes. Span durations grouped by service.instance.id would have lit up the slow one immediately. We didn’t, so we shelled into six pods reading logs manually. That incident is why I now ship OTel before Kafka. Not after.

Takeaways

Default to OTel SDK plus a Collector. Don’t ship straight to a vendor.
ParentBasedSampler plus tail sampling at the Collector. Head sampling alone lies to you.
Auto-instrumentations cover 80% of what you need. Manual spans only at boundaries the auto layer misses.
W3C traceparent everywhere. Run a composite propagator if you have legacy headers in the fleet.
Inject trace_id and span_id into your structured logger. Logs and traces are the same story.
Resource attributes like service.instance.id are the difference between “we have traces” and “we found the bad pod in two minutes.”

Thanks for reading. If you’ve got thoughts, send them my way.