How to design read model projections in DDD systems: sync versus async, multiple projections per event stream, safe rebuilds, and lag monitoring.
A federation tournament wrapped up on a Saturday night at the combat-sports tournament platform I CTO’d in London. The new champion should have shown up at the top of the public rankings within minutes. Eight hours later he was still ranked second. The athlete himself noticed before we did. He posted a screenshot of our rankings page on Twitter, tagging the federation. The rankings consumer was happily pulling events off Kafka the whole time. It just wasn’t writing them anywhere useful.
That’s the bit about read model projections nobody tells you up front. The consumer can be healthy and the read model can still be lying.
In a DDD system the aggregate is shaped for invariants, not for queries. An Order aggregate cares about state transitions, totals, refunds. The customer’s order history page cares about a flat list ordered by date, paginated, filterable by status. Those two shapes drift the moment you have any real traffic. You can paper over it with joins and views for a while. Then you hit the day when one query needs five joins and a window function and you realize you’re treating the write model like a query layer it was never meant to be.
Read model projections are the answer DDD has been suggesting since the start. Domain events are the seam. The write side emits OrderPlaced, OrderShipped, OrderRefunded. The read side consumes those and shapes whatever it needs.
The honest version of the rule: in any non-trivial DDD system, the read side belongs in projections built from domain events, not in clever joins on aggregate tables. And async beats sync almost every time.
A sync projector lives inside the same transaction as the command. Aggregate writes, projection writes, all-or-nothing. You get strong consistency, your user reads their own write, no caveats.
You also get a slower write path, coupling between two storage technologies, and the projector becomes part of the command’s blast radius. If the projection store is slow, the command is slow. If it’s down, the command fails.
I use sync in exactly one case: when a user must read their own write on the next page load, and there’s no UI affordance to fake it (no “your refund is being processed” banner). Otherwise async.
Here’s a transactional outbox writer, which is the version of sync I actually ship. The aggregate change and the domain event go into the database in one transaction. A separate worker drains the outbox and publishes to Kafka.
// src/orders/infrastructure/order.repository.ts
import { DataSource, EntityManager } from 'typeorm';
import { Order } from '../domain/order.entity';
import { DomainEvent } from '../../shared/domain/domain-event';
import { OutboxRecord } from '../../shared/infrastructure/outbox.entity';
export class OrderRepository {
constructor(private readonly ds: DataSource) {}
async save(order: Order): Promise<void> {
await this.ds.transaction(async (tx: EntityManager) => {
await tx.getRepository(Order).save(order);
const events = order.pullDomainEvents();
if (events.length === 0) return;
await tx.getRepository(OutboxRecord).insert(
events.map((e: DomainEvent) => ({
aggregateId: order.id,
aggregateType: 'Order',
eventType: e.constructor.name,
payload: e.toJSON(),
occurredAt: e.occurredAt,
processedAt: null,
})),
);
});
}
}
The reason this works and a “publish-then-write” pattern doesn’t: the database is the only thing that can give you transactional guarantees across the aggregate and the event log. Kafka can’t. Your in-process event bus can’t.
A small side note while we’re here. A read replica is not a read model. I once watched an on-call engineer scale up Aurora reader instances to fix what they thought was a read-side problem on the community product at the creator economy platform I worked at. Replica lag had ballooned past 14 minutes during a Tuesday morning spike. Bumping instance class didn’t help. The real cause was an ANALYZE job holding write-side locks and starving WAL emission. About 22 minutes of degraded reads. A projection wouldn’t have prevented the WAL issue, but it would have meant the user-facing query path wasn’t sitting on the same physical lineage as the write path. Different problem, related lesson: the further your read model is from your aggregate’s storage, the more room you have when one of them misbehaves.
The thing I see junior teams get wrong here: they build one read model called “OrderView” and shove every query into it. Six months later it’s got 40 columns, half of them nullable, and it’s serving four pages with conflicting requirements.
One stream feeds many projections. Each projection is shaped for one query. They share an event log, not a table.
// src/orders/projections/order.projector.ts
import { Kafka, Consumer } from 'kafkajs';
import { Pool } from 'pg';
type OrderEvent =
| { type: 'OrderPlaced'; eventId: string; orderId: string; customerId: string; total: number; occurredAt: string }
| { type: 'OrderShipped'; eventId: string; orderId: string; trackingNo: string; occurredAt: string }
| { type: 'OrderRefunded'; eventId: string; orderId: string; amount: number; occurredAt: string };
export class OrderProjector {
private readonly name = 'order_projector_v3';
constructor(private readonly kafka: Kafka, private readonly db: Pool) {}
async start(): Promise<void> {
const consumer: Consumer = this.kafka.consumer({ groupId: this.name });
await consumer.connect();
await consumer.subscribe({ topic: 'orders.events', fromBeginning: false });
await consumer.run({
autoCommit: false,
eachMessage: async ({ message, partition, topic }) => {
const event = JSON.parse(message.value!.toString()) as OrderEvent;
const client = await this.db.connect();
try {
await client.query('BEGIN');
const seen = await client.query(
'SELECT 1 FROM projection_events WHERE projector = $1 AND event_id = $2',
[this.name, event.eventId],
);
if (seen.rowCount && seen.rowCount > 0) {
await client.query('COMMIT');
return;
}
await this.applyCustomerOrderList(client, event);
await this.applyFulfillmentDashboard(client, event);
await client.query(
'INSERT INTO projection_events (projector, event_id, processed_at) VALUES ($1, $2, now())',
[this.name, event.eventId],
);
await client.query('COMMIT');
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
},
});
}
private async applyCustomerOrderList(client: any, e: OrderEvent): Promise<void> {
if (e.type === 'OrderPlaced') {
await client.query(
`INSERT INTO rm_customer_order_list (order_id, customer_id, total, status, placed_at)
VALUES ($1, $2, $3, 'placed', $4)
ON CONFLICT (order_id) DO NOTHING`,
[e.orderId, e.customerId, e.total, e.occurredAt],
);
} else if (e.type === 'OrderShipped') {
await client.query(
`UPDATE rm_customer_order_list SET status = 'shipped', shipped_at = $2 WHERE order_id = $1`,
[e.orderId, e.occurredAt],
);
} else if (e.type === 'OrderRefunded') {
await client.query(
`UPDATE rm_customer_order_list SET status = 'refunded', refunded_amount = $2 WHERE order_id = $1`,
[e.orderId, e.amount],
);
}
}
private async applyFulfillmentDashboard(client: any, e: OrderEvent): Promise<void> {
if (e.type === 'OrderPlaced') {
await client.query(
`INSERT INTO rm_fulfillment_queue (order_id, queued_at) VALUES ($1, $2)
ON CONFLICT (order_id) DO NOTHING`,
[e.orderId, e.occurredAt],
);
} else if (e.type === 'OrderShipped') {
await client.query(`DELETE FROM rm_fulfillment_queue WHERE order_id = $1`, [e.orderId]);
}
}
}
Two read models, one consumer group, one event log. Add a third one tomorrow and the existing ones don’t change. That’s the property you’re paying for.
The idempotency check is doing real work. Kafka guarantees at-least-once. Your projector will see the same event twice eventually, and it has to be safe.
The thing that makes projections worth the architectural cost: you can blow them away and rebuild from the event log. They are derived state. The aggregate is the source of truth.
The pattern I use, every time:
rm_customer_order_list_v4.This is exactly what saved us during the federation rankings incident I opened with. After the consumer drift, the read index was full of wrong rankings. We didn’t try to patch the index. We bulk-wrote a fresh one from PostgreSQL and atomic-aliased the Elasticsearch read alias. About 25 minutes start to finish. The page caught up cleanly when the alias swapped.
// scripts/rebuild-order-projection.ts
import { Pool } from 'pg';
import { Kafka } from 'kafkajs';
import { OrderProjector } from '../src/orders/projections/order.projector';
async function rebuild(version: string): Promise<void> {
const db = new Pool({ connectionString: process.env.DATABASE_URL });
const kafka = new Kafka({ brokers: process.env.KAFKA_BROKERS!.split(',') });
const tableNew = `rm_customer_order_list_${version}`;
await db.query(`CREATE TABLE ${tableNew} (LIKE rm_customer_order_list INCLUDING ALL)`);
const projector = new OrderProjector(kafka, db);
await projector.runFromBeginning({ targetTable: tableNew, fromOffset: 0 });
await db.query('BEGIN');
try {
await db.query(`ALTER TABLE rm_customer_order_list RENAME TO rm_customer_order_list_old`);
await db.query(`ALTER TABLE ${tableNew} RENAME TO rm_customer_order_list`);
await db.query('COMMIT');
} catch (err) {
await db.query('ROLLBACK');
throw err;
}
}
rebuild(process.argv[2]).catch((err) => {
console.error(err);
process.exit(1);
});
And the projection state schema, because this is the part that gets forgotten:
CREATE TABLE projection_events (
projector TEXT NOT NULL,
event_id UUID NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (projector, event_id)
);
CREATE INDEX projection_events_processed_at_idx
ON projection_events (projector, processed_at DESC);
Composite primary key on (projector, event_id) is doing all the idempotency work. Different projectors get to apply the same event independently.
This is the part I learned from the rankings incident. The original consumer had a circuit breaker around its bulk-write client. The Elasticsearch cluster had a transient blip overnight. The circuit opened. It never closed. The consumer kept consuming from Kafka and committing offsets and reporting healthy. It just stopped writing.
“Is the consumer up” is the wrong metric. The right one is freshness: how far behind is the read model from the events the write side has emitted.
// src/orders/projections/order.projector.ts (additional method)
async emitLagMetric(): Promise<void> {
const result = await this.db.query<{ oldest_pending: Date | null }>(
`SELECT MIN(occurred_at) AS oldest_pending
FROM outbox
WHERE processed_at IS NULL
AND aggregate_type = 'Order'`,
);
const oldest = result.rows[0]?.oldest_pending;
const lagSeconds = oldest ? (Date.now() - oldest.getTime()) / 1000 : 0;
metrics.gauge('projection.lag_seconds', lagSeconds, {
projector: this.name,
aggregate: 'Order',
});
}
Page on projection.lag_seconds > 60 per projector, not per consumer group. The consumer can be 0 lag on Kafka and your read model can still be 8 hours stale. Don’t trust throughput. Measure freshness.
DDD purity is not the goal. Pragmatic adoption is.
If your domain is small, your queries are simple, and your write model already serves your reads at p99 you can live with, don’t build projections. You’ll add a Kafka topic and a consumer process and an idempotency table for what could have been SELECT * FROM orders WHERE customer_id = ?. I’ve watched teams introduce CQRS in domains that didn’t need it, and the maintenance cost outran the architectural benefit by month three.
Start with projections when one of these is true: your read shape genuinely diverges from your aggregate shape, you have more than one read model needing the same events, or your read path is starting to dictate write-side concessions. Until then, query the write table.
Thanks for reading. If you’ve got thoughts, send them my way.