The Kafka Consumer Lag Nobody Noticed

A Monday morning support ticket taught me that Kafka consumer health is about event freshness, not throughput. Manual commits, idempotency keys, and a real consumer SLO.

A Monday morning. The first signal that something was off wasn’t a page, wasn’t a dashboard. It was a Slack message from support: “ranking page is showing yesterday’s standings”. The consumer was running. The Kafka broker was happy. The events were sitting in the topic, twelve hours stale, and not one alert had fired.

This was at the combat-sports tournament platform I CTO’d in London. The rankings page was one of the most-watched surfaces on the product. The consumer that fed it had been quietly green on every dashboard we owned. It was also quietly wrong, and had been since the previous afternoon.

The Monday morning Slack message

Here’s the setting. PostgreSQL was the system of record. Elasticsearch was the read layer. A consumer called rankings-indexer read off Kafka and projected events into the ES index. Standard CQRS-shaped thing.

Saturday night a federation tournament finished. A new champion took the top spot. Eight hours later the rankings page still had the old number one. The athlete in question noticed before we did and tweeted a screenshot tagging the federation. By the time my phone buzzed it was Monday and we were already publicly embarrassed.

I sshed into the indexer pod. Logs were quiet. No errors, no warnings. The consumer was sitting on a poll loop, happily reporting healthy. So I did the dumb thing first. I restarted it.

It came back up, picked up at its last checkpoint, started projecting again for new events, and did not backfill the eight hours of stale state. Restarting was the wrong move. It made the gap permanent.

The real fix took twenty five minutes. A one-shot reindex job read current rankings from PostgreSQL, bulk-wrote them to a new ES index, and atomic-aliased the read alias. The page caught up at the end. Root cause: the bulk-write client had silently flipped to “circuit open” after a transient ES blip the night before, and our circuit breaker had no path back to closed without a restart. Patched it to retry half-open every sixty seconds.

But that’s the technical fix. The bigger lesson came out of the postmortem.

Why throughput is the wrong metric

Our consumer “health” check was, embarrassingly, this:

import type { Consumer } from 'kafkajs'

export class ConsumerHealthCheck {
  constructor(private readonly consumer: Consumer) {}

  isHealthy(): boolean {
    const groupDescription = this.consumer.describeGroup
    return groupDescription !== undefined
  }
}

That returns true as long as the consumer exists. The consumer existed. The consumer was even polling. It just wasn’t writing anything downstream because its sink was circuit-broken. Our health check was measuring “is the process alive” and calling that “healthy”.

This is the consumer monitoring trap. Throughput and “is the consumer running” answer one question: are messages being read from the broker. They do not answer: are messages being processed into downstream state on a fresh enough cadence to be useful.

Useful is the operative word. A consumer that reads a million events a minute but writes zero into the index is not healthy. A consumer that reads ten events a minute but keeps the index current is healthy. The metric you actually care about is event age at the point of successful processing, not events per second.

Event age beats lag count

Partition lag is the thing every Kafka tutorial points you at. And it’s fine, sort of. The problem is partition lag is offset-based, which means a small-volume partition can show “lag: 4” and be twelve hours stale, while a high-volume partition can show “lag: 8000” and be three seconds behind. Lag without volume context is noisy.

What I want instead is the wall-clock age of the event at the moment my handler finishes processing it. That number is unambiguous. If it’s growing, the consumer is falling behind reality. If it’s bounded, I’m keeping up.

Here’s the wrapper we ship now, in TypeScript with kafkajs. The handler is wrapped, age is computed from the message header set at produce time, and we emit to OpenTelemetry. The actual handler doesn’t need to know about any of this.

import type { EachMessagePayload } from 'kafkajs'
import { metrics } from '@opentelemetry/api'

const meter = metrics.getMeter('kafka-consumer')
const eventAgeHist = meter.createHistogram('kafka.event_age_ms', {
  description: 'Wall-clock age of event at successful handler completion',
  unit: 'ms',
})

type Handler = (payload: EachMessagePayload) => Promise<void>

export function withFreshnessTracking(
  topic: string,
  handler: Handler,
): Handler {
  return async (payload) => {
    const producedAt = Number(payload.message.timestamp)
    await handler(payload)
    const ageMs = Date.now() - producedAt
    eventAgeHist.record(ageMs, {
      topic,
      partition: String(payload.partition),
    })
  }
}

The handler runs first. Age is recorded only after success. That ordering matters. If the handler throws, we don’t emit a misleading “this event was processed at age 200ms” when in fact it wasn’t processed at all.

In production we alert on the p99 of this histogram. The current rule is: max(event_age_ms) > 60s for 5m pages someone. That alert is what would have caught the rankings indexer drift on Saturday night, six hours before a federation athlete noticed.

Idempotent consumers, manual commits

The second thing the postmortem changed was how we commit offsets.

The kafkajs default is autoCommit: true. It’s convenient. It’s also a foot-gun. The default commits offsets on a timer, independent of whether your handler succeeded. So you can read a message, fail to write it downstream, crash, restart, and the broker thinks you’ve consumed it. You haven’t. The message is gone from your perspective and the downstream state is missing it.

There’s an even more painful version of this I’ve lived through, on a totally different surface. The branded mobile apps pipeline at the creator economy platform I worked at had a renewal-notification handler that took payments from Apple’s server-to-server callbacks. We returned 200 OK once the request hit the endpoint, then did the work async. The work occasionally took longer than Apple’s thirty second deadline. Apple retried. We had no idempotency check. Every retry created a new subscription row. A few thousand customers got double-billed before the creator screenshot-tweeted us about it.

The pattern is the same in both stories. The upstream thinks the message was handled because we acknowledged. The downstream state says otherwise. The gap between acknowledgement and durability is where the bug lives.

The fix is two parts. Manual commits, and idempotency keys derived from the message, not the offset. Here’s a NestJS-shaped Kafka consumer that does both.

import { Inject, Injectable } from '@nestjs/common'
import type { Consumer, EachMessagePayload } from 'kafkajs'
import { DataSource } from 'typeorm'

@Injectable()
export class OrderEventConsumer {
  constructor(
    @Inject('KAFKA_CONSUMER') private readonly consumer: Consumer,
    private readonly ds: DataSource,
  ) {}

  async handle(payload: EachMessagePayload): Promise<void> {
    const { topic, partition, message } = payload
    const idempotencyKey = message.headers?.['event-id']?.toString()
    if (!idempotencyKey) {
      throw new Error(`event without event-id header on ${topic}`)
    }

    await this.ds.transaction(async (tx) => {
      const seen = await tx.query(
        'SELECT 1 FROM processed_events WHERE event_id = $1',
        [idempotencyKey],
      )
      if (seen.length > 0) return

      await this.applyOrderEvent(tx, message)

      await tx.query(
        'INSERT INTO processed_events (event_id, topic) VALUES ($1, $2)',
        [idempotencyKey, topic],
      )
    })

    await this.consumer.commitOffsets([
      { topic, partition, offset: (Number(message.offset) + 1).toString() },
    ])
  }

  private async applyOrderEvent(/* ... */): Promise<void> {
    // domain write goes here
  }
}

The order is the contract. Transaction commits the downstream write plus the idempotency row atomically. Only then do we commit the Kafka offset. If the process dies between the DB commit and the offset commit, no harm done: the next poll re-delivers the message, the processed_events row says we’ve seen it, we skip and commit.

This is not novel. It’s the textbook outbox-inbox pattern. The reason I’m laboring it is that most teams I’ve worked with start with auto-commit because it’s easier, then have a production incident that looks suspiciously like ours did, and then switch. Better to start here.

The alert that should have fired

We pair two monitors now.

Burrow watches consumer-group offset lag and alerts on lag > threshold. It’s been around forever and it works. Then on the consumer side we emit event_age_ms and alert on freshness. Both are needed. Burrow tells you the consumer fell off the broker. Freshness tells you the consumer is on the broker but not actually doing useful work.

The Burrow config is unromantic. Here’s the relevant slice.

client-profile:
  rankings-cluster:
    client-id: burrow-rankings
    kafka-version: 3.6.0

cluster:
  rankings-cluster:
    class-name: kafka
    servers:
      - kafka-0.kafka.svc.cluster.local:9092
      - kafka-1.kafka.svc.cluster.local:9092
    client-profile: rankings-cluster
    topic-refresh: 60
    offset-refresh: 30

consumer:
  rankings-indexer:
    class-name: kafka
    cluster: rankings-cluster
    servers:
      - kafka-0.kafka.svc.cluster.local:9092
    client-profile: rankings-cluster
    offsets-topic: __consumer_offsets

notifier:
  pagerduty:
    class-name: http
    url: https://events.pagerduty.com/v2/enqueue
    threshold: WARN
    interval: 60
    timeout: 10

And the freshness monitor on the Datadog side, expressed roughly:

max(last_5m):max:kafka.event_age_ms{topic:match-events} > 60000

That’s the one that pages. Sixty seconds of event age is the SLO line. Above that, the consumer is not healthy, regardless of what the lag chart says.

A consumer SLO you can defend

The rankings incident also gave me the language. We started writing consumer SLOs the way we wrote API SLOs. Pick a freshness target, write it down, alert on the burn rate.

The format I land on now: “99% of events on topic X processed within 30 seconds of broker arrival, measured over a rolling 7 day window.”

That wording matters. It’s testable, anyone can pull the histogram and check. And it’s negotiable with product. Tighter freshness needs more capacity. Looser saves money. The vague “is Kafka working” question becomes a budget conversation.

When the budget burns, the rule is unsexy. Stop shipping consumer features on that topic. Fix freshness. Earn the budget back. The first time we held that line it took a week. It hurt. It also worked, and nobody has questioned the SLO since.

Takeaways

Measure event age at the handler, not partition lag. Lag is offset-relative and lies on low-volume topics.
Auto-commit will lose you messages. Use manual commits paired with idempotency keys derived from the message.
Derived indexes need their own freshness signal. “Consumer is consuming” is not a health check.
Pair Burrow with a consumer-side freshness monitor. Different failure modes, different alerts.
Write the consumer SLO down. 99% of events processed within N seconds. Anything else is hand-waving.

Thanks for reading. If you’ve got thoughts, send them my way.