How I think about aggregate sizing in DDD: small aggregates, one per transaction, optimistic concurrency, and how to split a god-aggregate without breaking invariants.
It was past midnight UTC, only about 6 p.m. Pacific, and login error rate at the creator economy platform I worked at had just hit 100%. A schema migration on users had grabbed an ACCESS EXCLUSIVE lock and was holding it for what would end up being 87 seconds. I was 8 minutes into a Slack DM when the war room channel filled up with the FIRE emoji.
The post-mortem looked, on paper, like a Rails problem. add_column_with_default on a table with hundreds of millions of rows is just slow on Aurora. That’s the surface fix.
The actual problem was older. User was a god-aggregate. We loaded the whole thing to update one field. Years earlier, “user has subscriptions has invoices” had felt like good modelling. Eventually it was the table everything wanted a lock on.
So this post is about aggregate sizing. Where do I draw the boundary, when do I split, and how do I recognize I’ve drawn it in the wrong place.
Vaughn Vernon’s heuristic still holds. Prefer small aggregates. One root, the minimum entities needed to enforce a real invariant, and IDs (not references) to everything else.
The trap is treating “real-world hierarchy” as the modelling input. A customer has orders, an order has line items, line items have discounts. That’s a tree. It’s not an aggregate. The aggregate exists to protect a consistency rule. If the only rule is “an order’s line items sum to its total”, then Order is the aggregate and customer is just an ID on it.
// src/domain/order/order.ts
import { randomUUID } from "node:crypto";
import { LineItem } from "./line-item";
import { OrderPlaced } from "./events/order-placed";
export class Order {
private _events: unknown[] = [];
private constructor(
public readonly id: string,
public readonly customerId: string,
private readonly items: LineItem[],
private _status: "draft" | "placed" | "cancelled",
public readonly version: number,
) {}
static create(customerId: string, items: LineItem[]): Order {
if (items.length === 0) {
throw new Error("Order must have at least one line item");
}
return new Order(randomUUID(), customerId, items, "draft", 0);
}
place(): void {
if (this._status !== "draft") {
throw new Error(`Cannot place order in status ${this._status}`);
}
this._status = "placed";
this._events.push(new OrderPlaced(this.id, this.customerId, this.totalCents()));
}
totalCents(): number {
return this.items.reduce((s, li) => s + li.subtotalCents(), 0);
}
pullEvents(): unknown[] {
const out = this._events;
this._events = [];
return out;
}
}
Notice what’s missing. No customer reference. No payments collection. No shippingAddress entity hanging off. The customer’s billing profile, auth credentials, preferences. They live in their own aggregates. Order only knows the customer’s ID.
The rule that earns its keep more than any other. One aggregate root mutated per transaction.
Anything else, you do via a domain event the aggregate emits, picked up by a handler that opens its own transaction against its own aggregate. Yes, that means eventual consistency between aggregates. State that out loud to your team. It’s not a bug. It’s the boundary you chose.
Break this rule and you get the symptoms in waves. Cross-aggregate deadlocks. Lost updates because two aggregates fought for the same row. Invariants that pretend to be transactional but actually aren’t, because the second save() happened in a different commit half a second later.
// src/infra/order/order.repository.ts
import { PrismaClient } from "@prisma/client";
import { Order } from "../../domain/order/order";
export class ConcurrencyError extends Error {}
export class OrderRepository {
constructor(private readonly prisma: PrismaClient) {}
async save(order: Order, expectedVersion: number): Promise<void> {
await this.prisma.$transaction(async (tx) => {
const updated = await tx.order.updateMany({
where: { id: order.id, version: expectedVersion },
data: { /* mapped fields */ version: expectedVersion + 1 },
});
if (updated.count === 0) {
throw new ConcurrencyError(
`Order ${order.id} was modified since version ${expectedVersion}`,
);
}
for (const evt of order.pullEvents()) {
await tx.outbox.create({
data: {
aggregateId: order.id,
type: evt.constructor.name,
payload: JSON.stringify(evt),
},
});
}
});
}
}
The updateMany with version: expectedVersion is the optimistic-concurrency check. Zero rows updated means somebody else saved between when I loaded and when I wrote, and I get a ConcurrencyError instead of silently overwriting. The outbox row goes inside the same transaction as the aggregate write, so the event publication is atomic with the state change. No two-phase commit, no separate message bus call mid-transaction.
I prefer optimistic concurrency over pessimistic locking for almost every domain mutation. Pessimistic locks on a hot row at scale are exactly how you end up with the migration story I opened with. A version column and a retry on conflict is cheaper, behaves better under load, and pushes the conflict-resolution decision back into the domain layer where it belongs.
Symptoms you can feel without running a profiler:
That last one. That was us. User owned credentials, subscription state, billing profile, notification preferences, marketing flags, feature opt-ins. Adding one non-null column touched everyone. The migration used strong_migrations and its safer add_column_with_default helper, and it still produced a 90-second lock at our row count. Rolling it back mid-flight would be worse than letting it finish, so we waited. 87 seconds. Logins recovered ~15 seconds after the lock released because the dependent service had a tight retry loop. Maybe 30K affected sign-in attempts.
What I changed afterwards wasn’t the migration policy. It was the aggregate. The migration would have been trivial if User had been three aggregates instead of one.
The refactor isn’t glamorous. Three steps I’ve done enough times to trust.
List the invariants the aggregate enforces. Not the fields, the invariants. “Email must be unique per active account” is an invariant. “A user has a name and a billing address” is not, it’s a join.
Group the invariants by the smallest set of fields they touch. Each group is a candidate aggregate. If “email uniqueness” only needs email, status, id, that’s UserAccount. If “subscription state matches receipts” needs userId, plan, status, renewedAt, that’s UserSubscription. They reference each other by ID, not by object.
The cross-aggregate workflows become domain events. Account suspends, subscription cancels. Subscription renews, account gets a feature flag flipped. Async, idempotent, through the outbox.
// src/domain/user/user-account.ts
import { AccountSuspended } from "./events/account-suspended";
export class UserAccount {
private _events: unknown[] = [];
constructor(
public readonly id: string,
public readonly email: string,
private _status: "active" | "suspended" | "deleted",
public readonly version: number,
) {}
suspend(reason: string): void {
if (this._status !== "active") return; // idempotent
this._status = "suspended";
this._events.push(new AccountSuspended(this.id, reason));
}
pullEvents(): unknown[] {
const out = this._events;
this._events = [];
return out;
}
}
// src/application/billing/on-account-suspended.ts
import { UserBillingProfileRepository } from "../../infra/user/user-billing-profile.repository";
export class OnAccountSuspended {
constructor(private readonly repo: UserBillingProfileRepository) {}
async handle(evt: { userId: string }): Promise<void> {
const profile = await this.repo.findByUserId(evt.userId);
if (!profile) return;
profile.pauseAutoRenew();
await this.repo.save(profile, profile.version);
}
}
UserAccount doesn’t know UserBillingProfile exists. It announces what happened. A handler in the billing context reacts. Different transaction, different aggregate, different team can own the code.
Derived state rots silently. That’s the thing nobody tells you. If a projection exists, it needs its own freshness metric, not the consumer’s lag.
// src/infra/outbox/relay.ts
export async function relayBatch(prisma: PrismaClient, publish: (e: unknown) => Promise<void>) {
const batch = await prisma.outbox.findMany({
where: { publishedAt: null },
orderBy: { id: "asc" },
take: 100,
});
for (const row of batch) {
try {
await publish(JSON.parse(row.payload));
await prisma.outbox.update({
where: { id: row.id },
data: { publishedAt: new Date() },
});
} catch (err) {
// Leave unpublished, next tick retries. Don't poison the batch.
break;
}
}
}
Boring on purpose. The thing that makes this work, that everyone forgets, is that the consumer on the other side must be idempotent. Same event, same effect, no matter how many times it arrives.
Thanks for reading. If you’ve got thoughts, send them my way.