When a DDD bounded context earns its own service, and when splitting it just gives you a distributed monolith with extra steps.
Thursday afternoon at the digital product agency I led engineering at in London. We’d spent months migrating a portfolio of legacy projects to a DDD-flavored architecture. Bounded contexts mapped, ubiquitous language docs per domain. And one of the senior engineers walked in with a slide deck titled “Splitting the Billing Context into 4 Microservices.”
I said no. He was kind of right the context had grown and the team was hitting on each other’s PRs. But what he wanted was to take one bounded context with a clean internal model and shatter it into four network-coupled services because that’s what the article he’d read on the train said to do.
That meeting is the prompt for this post.
A bounded context is a linguistic boundary. Inside it, words mean exactly one thing. Outside, the same word probably means something else. Order in checkout is not Order in fulfillment, even if they share an id.
A microservice is a deployment and ownership boundary. Its own pipeline, its own database ideally, its own pager rotation.
Related but not equal. A bounded context says where the model has to stay consistent. A microservice says where the operational responsibility lives. You can have a bounded context living happily inside a monolith. You can also have a microservice that, sadly, contains three bounded contexts mashed together because someone shipped it on a deadline.
The trap is treating bounded context discovery as if it produced a service map. It doesn’t. It produces a candidate list.
Before you split anything, classify the subdomain. This is the part the DDD books cover and most engineering decks skip.
The classification changes the answer to “should this be its own service” more than any architecture diagram will.
// src/contexts/registry.ts
export type SubdomainType = "core" | "supporting" | "generic";
export interface BoundedContextSpec {
name: string;
type: SubdomainType;
ubiquitousLanguageDoc: string;
ownedBy: string;
publishesIntegrationEvents: string[];
consumesIntegrationEvents: string[];
}
export const contexts: BoundedContextSpec[] = [
{
name: "checkout",
type: "core",
ubiquitousLanguageDoc: "docs/contexts/checkout.md",
ownedBy: "team-commerce",
publishesIntegrationEvents: ["OrderPlaced", "PaymentAuthorized"],
consumesIntegrationEvents: ["InventoryReserved"],
},
{
name: "fulfillment",
type: "supporting",
ubiquitousLanguageDoc: "docs/contexts/fulfillment.md",
ownedBy: "team-ops",
publishesIntegrationEvents: ["ShipmentDispatched"],
consumesIntegrationEvents: ["OrderPlaced"],
},
{
name: "identity",
type: "generic",
ubiquitousLanguageDoc: "docs/contexts/identity.md",
ownedBy: "team-platform",
publishesIntegrationEvents: ["UserRegistered"],
consumesIntegrationEvents: [],
},
];
This registry gives architecture reviews a concrete artifact instead of a whiteboard photo.
My rule, after a lot of scar tissue, is short. A context graduates to its own service when at least two of these are true.
If only one is true, keep it in a modular monolith with a clear internal boundary. Same repo, separate modules, separate schemas if you can swing it.
// src/contexts/checkout/domain/order.ts
import { AggregateRoot } from "../../shared/aggregate-root";
import { Money } from "../../shared/money";
import { OrderPlaced } from "./events/order-placed";
export class Order extends AggregateRoot {
private constructor(
private readonly id: string,
private readonly customerId: string,
private readonly lines: OrderLine[],
private status: OrderStatus,
) {
super();
}
static place(input: {
id: string;
customerId: string;
lines: OrderLine[];
}): Order {
if (input.lines.length === 0) {
throw new EmptyOrderError(input.id);
}
const order = new Order(input.id, input.customerId, input.lines, "placed");
order.record(
new OrderPlaced({
orderId: order.id,
customerId: order.customerId,
total: order.total(),
placedAt: new Date(),
}),
);
return order;
}
total(): Money {
return this.lines.reduce(
(sum, line) => sum.plus(line.subtotal()),
Money.zero("USD"),
);
}
}
Notice. Nothing in this aggregate cares whether the fulfillment context is across an HTTP boundary or in the same process. The integration event is the seam. Move the seam to Kafka later if you have to.
At the real-time trading platform I architected, we ran one fat Node.js service for the entire data plane. Quotes, charts, sessions, watchlists, alerts, all in one process. Then preopen on a Tuesday after a long bank-holiday weekend hit us with a reconnection storm.
The setting. Sockets over a gateway tier behind nginx. At 09:31:14, 74 seconds after the open, connections started dropping en masse. Clients reconnected, dropped, reconnected. Within 90 seconds every gateway pod was pinned at 100% CPU. p99 tick fan-out went from around 80 ms to around 3 s. Stale charts on a trading product is the worst failure mode there is.
First wrong fix. Scaled gateway pods 3x via the autoscaler override. New pods came online, hit the storm, went CPU-bound within 20 seconds. I was feeding the fire.
Real fix. Two things in parallel. An emergency client-side config push with jittered exponential backoff, min 200ms, max 30s, factor 2, jitter plus or minus 50%. And a per-IP rate limiter at nginx, tight at 3 new connections per second per IP. Around 8 minutes later the pool stabilized and tick fan-out came back under 200 ms.
Cost and lesson. Around 14 minutes of degraded tick delivery during market open. The lesson for service boundaries. The bounded context for the gateway was clean, the events were in place, the deploy boundary was the only thing missing. If it had been its own service, we’d have shipped the limiter and the config push without dragging the entire data plane along.
That was the day I started taking the modular-monolith-as-stepping-stone idea seriously.
The other place teams get this wrong. They draw the bounded contexts on the whiteboard, then ignore the team shape that’s actually shipping the code.
At the combat-sports tournament platform I CTO’d in London, the rankings context originally lived inside the tournaments service. Then the team split, and rankings got handed to a new squad that did not know Kafka.
The setting. Rankings page on Elasticsearch, fed by a rankings-indexer consumer reading off Kafka, with PostgreSQL as system of record. A federation tournament finished on a Saturday night.
What went wrong. Eight hours later, the rankings page still showed the old number one. The athlete had a verified social account, tweeted a screenshot tagging the federation. The indexer had stopped projecting events overnight but no one paged, because it was still consuming from Kafka, just not writing to Elasticsearch.
First wrong fix. Logs were quiet, restart the indexer. Restart cleared the offset and reprojected from a checkpoint 12 hours stale. New events fine, old wrong rankings still in the index.
Real fix. Full reindex from PostgreSQL via a one-shot job, then atomic-alias the read index to the new one. Around 25 minutes to catch up. Root cause. The bulk-write client had silently entered a “circuit open” state after a transient cluster blip the night before, and the breaker had no path back to closed. Patched it to attempt half-open every 60 seconds.
Cost and lesson. 8 hours of stale rankings during a publicly visible competition window. We’d drawn the bounded context correctly but assigned it to a team that didn’t own its operational tooling. The context shouldn’t have been extracted into its own pipeline until the team had the on-call shape to back it. Derived indexes need a freshness metric, not just “is the consumer consuming.”
# datadog/monitors/rankings_freshness.yaml
name: "rankings_index_freshness"
type: metric alert
query: >
max(last_5m):
max:rankings.indexer.event_to_es_lag_seconds{env:prod} > 120
message: |
Rankings index is falling behind. Page Tournament Platform on-call.
Runbook: docs/runbooks/rankings-freshness.md
options:
evaluation_delay: 60
no_data_timeframe: 10
notify_no_data: true
thresholds:
critical: 120
warning: 30
The point isn’t the YAML. If you split a bounded context off into its own service, the on-call team has to own its observability before the cutover, not after the incident.
Thanks for reading. If you’ve got thoughts, send them my way.