CQRS in Microservices

An opinionated take on splitting command and query services with Postgres writes, Elasticsearch and Redis reads, and event-driven projections.

The first time I split commands from queries across two services was at the combat-sports tournament platform I CTO’d in London. Rankings page was on fire every Saturday during live broadcasts. The write path was fine, the read path was choking on joins across half a dozen tables, and the obvious move was to stop reading from the writer. So we did. Postgres stayed as the system of record. Elasticsearch became the rankings read store. A Kafka topic between them. That separation is what people mean when they say CQRS in microservices, and it is the most overhyped, most over-applied pattern in the backend canon.

I still think it pays off. I also think most teams reach for it about two years too early.

When the split pays off

The bar is pretty narrow. You want CQRS across services when the read shape is fundamentally different from the write shape, when you have multiple consumers each needing different projections of the same writes, and when the freshness budget is loose enough to tolerate eventual consistency. If you answer “no” to any of those, you probably want a denormalized table on the same database and a saved query.

The rankings page is the canonical example. Writes are tiny match-result events. Reads are a sorted leaderboard with athlete metadata, division, weight class, recent form, joined across many tables. Doing that join on every read against the writer would have killed it. Projecting into Elasticsearch and serving the page from there cost us maybe a hundred lines of indexer code and bought a read path that could survive a live broadcast.

The shape I keep returning to

Here is roughly the shape I keep coming back to. Command service writes to Postgres and emits an event. Query service consumes the event, projects into a read store. Reads never touch the writer.

import { Controller, Post, Body } from '@nestjs/common';
import { ClientKafka } from '@nestjs/microservices';

@Controller('matches')
export class MatchCommandController {
  constructor(private readonly kafka: ClientKafka) {}

  @Post('result')
  async recordResult(@Body() body: RecordMatchResultDto) {
    const match = await this.matches.recordResult(body);
    await this.kafka.emit('match-events', {
      type: 'MatchCompleted',
      matchId: match.id,
      winnerId: match.winnerId,
      divisionId: match.divisionId,
      occurredAt: match.completedAt.toISOString(),
    }).toPromise();
    return { ok: true };
  }
}

That emit is the contract. The write transaction is closed before the event leaves. If the broker is down, I queue it via an outbox table and let a background worker drain. Never emit inside the same transaction as your write, never rely on the broker as a transactional resource. You will get burned.

The projection side reads off the topic and writes to the read store. For rankings, I picked Elasticsearch because the read shape needed full-text, filters, sorting, pagination, all served from one query.

import { Controller } from '@nestjs/common';
import { EventPattern, Payload } from '@nestjs/microservices';

@Controller()
export class RankingsProjector {
  @EventPattern('match-events')
  async onMatchEvent(@Payload() event: MatchCompletedEvent) {
    if (event.type !== 'MatchCompleted') return;
    const ranking = await this.rankings.recompute(event.divisionId);
    await this.es.index({
      index: `rankings-${event.divisionId}`,
      id: ranking.athleteId,
      document: ranking,
      refresh: 'wait_for',
    });
  }
}

For hotter, simpler reads I use Redis. A creator dashboard summary, a feed counter, a “currently active” flag. Hash or sorted set, pay no attention to schemas, accept that you might lose it and reproject from Postgres. That choice of read store is the most consequential one in the whole pattern.

A war story about freshness

Saturday night, a federation tournament wrapped up around midnight. New champion’s ranking should have updated within a few minutes. Eight hours later the rankings page still showed the old number one. The athlete noticed before we did. He tweeted a screenshot of our broken rankings page tagging the federation. Not the kind of escalation you want.

The indexer was still consuming from Kafka. It just was not writing to Elasticsearch. The ES client had silently entered a circuit-open state after a transient cluster blip the night before, and the breaker did not have an automated retry path back to closed. So the consumer kept pulling, kept advancing offsets, kept dropping writes on the floor. No alarms fired because we were measuring “is the consumer lagging” and not “is the projection fresh.”

First wrong fix was a restart. The pod came up, consumed from the saved offset, started doing the right thing for new events. The old wrong rankings were still in the index. Fix that actually worked: a full reindex from Postgres into a new ES index, atomic-aliased the read index across. The reindex took about twenty-five minutes. Page caught up at the end.

The lesson from that one is the lesson of CQRS in microservices generally. Your derived index is a separate system with its own health. Measure freshness, not throughput. I have a Datadog metric on every projection now that compares the latest event timestamp in the source to the latest in the projection. If it drifts past my SLA, I get paged.

Where the cost actually lives

Operating two services and a broker is not the expensive part. The expensive part is owning the projection rebuild story. Every projection you ship needs a one-shot job that can rebuild it from Postgres, an alias-swap deploy story, a freshness metric, and a runbook for “the projector is dead, what do we do.” Without those four things you do not have CQRS, you have a distributed liability.

If you are at a stage where you can write that runbook and feel good about it, ship the split. If you are not, put a denormalized table behind a service-local repository and revisit when the read shape actually diverges.

Takeaways

Split commands and queries across services only when the read shape genuinely differs from the write shape and you have multiple consumers.
The write transaction closes before the event leaves. Use an outbox or accept that you will lose events.
Pick the read store on shape, not on hype. Elasticsearch for searchable sorted reads, Redis for hot simple lookups, a denormalized Postgres table for everything in between.
Measure projection freshness, not consumer throughput. The consumer can be healthy while the projection is silently rotten.
Own the rebuild runbook before you ship the projection. A projector you cannot reindex is technical debt with extra steps.
CQRS does not paper over writer-side incidents. It defers them.

Thanks for reading. If you’ve got thoughts, send them my way.