Microservice Logging Aggregation

Why I default to Grafana Loki over ELK for shipping microservice logs, with Fluent Bit configs, trace-ID correlation, and sampling rules from production.

It was a Wednesday afternoon at the creator economy platform I worked at, and a Slack thread had been going for forty minutes trying to find one failing request across thousands of pods. The trace ID was in a Datadog span. The logs we actually needed lived in a different tool because someone three years ago wired the billing service to a separate Elasticsearch cluster. The on-call bounced between two UIs, copy-pasting an ID, getting partial hits, going back. That afternoon is why I have strong opinions about log aggregation.

Short version. For a new system I’d ship Grafana Loki, Fluent Bit shipping, structured JSON with a trace ID on every line, aggressive sampling, and log-based alerts that fire on shape, not volume. ELK is fine if you already run it. At the scale I work at, the cost curve and indexing pain stopped being worth it.

ELK vs Loki, briefly

Elasticsearch indexes every field. That’s the feature and the problem. Full-text search across hundreds of services is fast, but you pay for it in CPU, disk, and the hours someone spends tuning shard counts. The team that owned our ES cluster spent a real chunk of every week on rebalancing.

Loki flipped that for me. It indexes labels, not content. Logs sit as compressed chunks in S3. You query with LogQL, which feels like PromQL with grep bolted on. Honest trade, queries that scan raw text are slower, but the cost drop made me stop caring. For the queries you actually run on-call, “everything for trace ID X in the last fifteen minutes”, labels do the heavy lifting.

One thing nobody tells you about Loki. Label cardinality is the whole game. Put user_id as a label and you’ve turned Loki into a worse Elasticsearch. Keep labels coarse, service, environment, level. Let the body carry the rest.

Fluent Bit on every pod

Fluent Bit is small enough to run as a DaemonSet without complaining and flexible enough to handle the weird stuff. Filebeat is fine too. I’ve shipped both. Fluent Bit wins when you want to do parsing and enrichment at the node, not centrally.

# fluent-bit.conf.yaml, mounted into the DaemonSet
service:
  flush: 1
  log_level: info
  parsers_file: parsers.conf

pipeline:
  inputs:
    - name: tail
      path: /var/log/containers/*.log
      parser: cri
      tag: kube.*
      mem_buf_limit: 50MB
      skip_long_lines: on

  filters:
    - name: kubernetes
      match: kube.*
      kube_url: https://kubernetes.default.svc:443
      merge_log: on
      keep_log: off
      k8s-logging.parser: on

    - name: modify
      match: kube.*
      remove: stream
      remove: time

    - name: grep
      match: kube.*
      exclude: log /healthz

  outputs:
    - name: loki
      match: kube.*
      host: loki-gateway.observability.svc
      port: 3100
      labels: service=$kubernetes['labels']['app'], env=$kubernetes['namespace_name'], level=$level
      label_keys: $trace_id
      auto_kubernetes_labels: off

That grep filter excluding /healthz is not optional. Liveness probes can easily be half your log volume. I learned that on a side product I CTO, where Loki ingestion tripled in a month and the diff was almost entirely health check noise. Took an hour to fix.

auto_kubernetes_labels: off is also deliberate. Kubernetes labels are great metadata but they blow up cardinality if you push them all into Loki. Promote only what you filter on.

Structured JSON and the trace ID

Every service emits JSON logs. Every line has a trace ID. No exceptions, no “we’ll add it later”. If you take one thing away from this post, take this one.

import pino from 'pino';
import { context, trace } from '@opentelemetry/api';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: process.env.SERVICE_NAME,
    env: process.env.NODE_ENV,
  },
  mixin() {
    const span = trace.getSpan(context.active());
    if (!span) return {};
    const sc = span.spanContext();
    return {
      trace_id: sc.traceId,
      span_id: sc.spanId,
    };
  },
  redact: {
    paths: ['req.headers.authorization', 'req.body.password', 'req.body.card.*'],
    censor: '[REDACTED]',
  },
});

mixin runs on every log call and pulls the active span from OpenTelemetry. That’s the part most setups get half right. They tag the request log line but miss the error log from deep inside a service call. With mixin you get it on every line that runs in the active context, no manual passing.

On the NestJS side I wire pino into an interceptor so request-scoped fields ride along:

import { Injectable, NestInterceptor, ExecutionContext, CallHandler } from '@nestjs/common';
import { Observable, tap, catchError, throwError } from 'rxjs';
import { logger } from './logger';

@Injectable()
export class RequestLogInterceptor implements NestInterceptor {
  intercept(ctx: ExecutionContext, next: CallHandler): Observable<unknown> {
    const req = ctx.switchToHttp().getRequest();
    const start = Date.now();
    const child = logger.child({
      method: req.method,
      path: req.route?.path ?? req.url,
      request_id: req.headers['x-request-id'],
    });

    return next.handle().pipe(
      tap(() => child.info({ duration_ms: Date.now() - start, status: 'ok' })),
      catchError((err) => {
        child.error({ duration_ms: Date.now() - start, err }, 'request failed');
        return throwError(() => err);
      }),
    );
  }
}

The trace ID coming from the W3C traceparent header means a request that crosses six services shows up as one query. {service=~".+"} |= "trace_id=abc123" and you’re done.

Retention tiers and trace-aware sampling

Two knobs. Retention is how long you keep logs. Sampling is how many you keep in the first place.

I run a tiered policy. Errors and warnings live 30 days. Info lives 7 days. Debug lives 24 hours and only ships when a feature flag is on. The vast majority of volume is debug and info, and almost nobody queries them past day three.

For sampling, the thing to sample isn’t random lines, it’s high-volume repetitive ones. The request log on a service doing thousands of RPS isn’t useful at full fidelity. Sample at the edge.

import crypto from 'crypto';

export function shouldSampleAccessLog(traceId: string, rate = 0.05): boolean {
  // hash-based sampling so a trace either logs every hop or none
  const h = crypto.createHash('sha1').update(traceId).digest();
  const bucket = h.readUInt32BE(0) / 0xffffffff;
  return bucket < rate;
}

Hash on the trace ID, not on Math.random(). That way a sampled trace logs at every service it touches, or at none. Random sampling gives you swiss cheese, and swiss cheese is worse than no logs because it makes you think you have data when you have a fraction of it.

Two stories that shaped this

The federation platform I was acting CTO at. Saturday afternoon, a live combat-sports tournament being broadcast publicly. Around the third bout, the standings-projector consumer group started rebalancing every thirty seconds. The standings page froze at 14:32 during the broadcast. Logs existed, sort of, unstructured strings going to a flat file collector, with no way to correlate the rebalances with the slow handler call actually causing them.

First wrong move, kubectl rollout restart on the projector deployment. It kicked off another rebalance forty seconds later. Real fix came from pulling pod logs side by side and noticing one pod out of six was logging max.poll.interval.ms of 60s while the other five logged 300s. That pod was running a stale image. Twelve minutes of stale standings during a live broadcast. The lesson was structural. If we’d had a kafka.config.applied event logged once at boot per pod, the bad pod would have been a one-query find. After that, every Kafka service got structured boot logs with a config hash, plus a Loki alert that fires when a consumer group has more than one config hash in five minutes.

The trading platform I architected, market open on the Tuesday after a bank-holiday weekend. Seventy-four seconds after the open, connections started dropping en masse. Within ninety seconds the reconnect storm had every gateway pod pinned at 100% CPU. p99 tick fan-out went from around 80ms to around 3s. First wrong fix was scaling pods, which fed the fire. Real fix was a jittered exponential backoff push to clients plus a per-IP rate limit at nginx.

The logging gap, we had structured access logs but no log-based alert on reconnection rate per IP. Just CPU and connection-count alerts. Both fired late. What stuck after the incident was a Loki rule that watches reconnect events per IP on a fifteen-second window. Has fired twice since, both times early enough to matter.

Log-based alerts for shape, not volume

Treat logs as a signal source for alerts, not just a forensic archive. Loki ruler does this well.

groups:
  - name: gateway-rules
    interval: 30s
    rules:
      - alert: GatewayReconnectStorm
        expr: |
          sum by (client_ip) (
            count_over_time(
              {service="gateway", level="warn"}
                |= "client_reconnect"
              [1m]
            )
          ) > 20
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Reconnect storm from {{ $labels.client_ip }}"

      - alert: KafkaConsumerConfigDrift
        expr: |
          count by (consumer_group) (
            count_over_time(
              {service=~".+", event="kafka.config.applied"} [5m]
            ) unless on (consumer_group, config_hash) (
              count_over_time(
                {service=~".+", event="kafka.config.applied"} [5m]
              )
            )
          ) > 1

Alert on the shape of the log stream, not the volume. Volume alerts page you when nothing is wrong. Shape alerts page you when something is actually different.

Takeaways

Default to Loki with Fluent Bit. ELK only if you already run it and the team is happy.
Structured JSON, trace ID on every line, OpenTelemetry context as the source of truth.
Keep labels coarse. Body carries the rest. Cardinality is the whole bill.
Drop health-check noise at the collector. Sample high-volume lines by hashing trace ID.
Tier retention by level. Errors live longer than info, debug barely lives at all.
Build log-based alerts on shape, like config drift or reconnect spikes. Volume alone lies.

Thanks for reading. If you’ve got thoughts, send them my way.