A failed decomposition that made latency worse and outages louder. Why I now default to a modular monolith and only extract services when there's a real scaling axis.
We celebrated the day we shipped the split. Champagne in the Slack channel, screenshots of the new service mesh, the whole thing. Two weeks later I was on a 2 a.m. call because the login flow now needed seven services to agree before a user could see their dashboard, and two of them were timing out on each other.
That was the moment I admitted out loud that we’d built a distributed monolith. We hadn’t decomposed anything. We’d taken a working monolith, sliced it into pieces, and re-introduced every line of coupling over a network. Same blast radius. Worse latency. Way worse debugging.
I’ve thought about that engagement a lot since. It was at a startup I was at a few years back, not the recent stuff. The lesson stuck though, and it’s shaped how I architect things at every place I’ve been since.
The monolith we were breaking up wasn’t actually in trouble. It was a Rails app, maybe ten years old at the time, running on Aurora. Some hot paths were ugly. There were a few endpoints that wanted background processing. Nothing existential.
But the org had decided we were going to do microservices. The CTO had read the right blog posts. A platform team had been hired. The Helm charts looked great.
So we cut. User service, billing service, notifications service, search service, profile service, social-graph service, audit service. Seven services. All talking over HTTP and a shared Aurora cluster, because nobody wanted to also do the database split in the same quarter.
Honestly, that should have been the first warning sign.
Here’s my working definition now. A distributed monolith is a set of services that have to deploy together to ship a feature. If you can’t change one without coordinating a PR in another, it’s one service wearing three name tags.
The symptoms are pretty consistent.
p99 climbs and it climbs for boring reasons. Every user-facing read becomes a chain of synchronous RPCs. Login was the loudest one for us. Auth service called user service called billing service called permissions service. Four hops. Each one healthy at the median, each one occasionally spiking. Multiply the spike probability across four hops and you get a p99 that looks like a heartbeat monitor.
Schema changes get scary. We had a shared Aurora cluster sitting behind those services, with cross-service foreign keys nobody had inventoried. A schema migration in user service could lock a table that profile service was reading. We hit one of those later, a non-null column added to a hot user table with hundreds of millions of rows. Login error rate went to 100% for about 85 seconds. PagerDuty woke half the senior engineers in the timezone. The lock didn’t recognize service boundaries.
Deploys turn into ceremonies. Want to add a new field that the frontend needs? PR in the API gateway. PR in the service that owns the data. PR in the contract repo. PR in the consumer. Four PRs, three reviewers each, lined up across two days of standup discussion. A junior engineer on my team timed it once. The end-to-end “add a string field” feature took eleven days, almost all of it waiting.
That’s not microservices. That’s a monolith you have to fax between racks.
We tried to fix it with more service-y things, which is the move I’d warn anyone against now.
The thinking was: the problem is that the services are too coupled, so let’s make them more independent. Contract tests. Schema registries. Queues between everything that’s currently synchronous. We standardized on Kafka as the async backbone, the same way I’d done at a federation platform I was acting CTO at a few years earlier.
It bought us about a month of better-feeling deploys.
Then the queues turned into hidden coupling too, and the failure modes got more interesting. We had a consumer group on a hot topic, six pods, and one of them had a different max.poll.interval.ms value because someone shipped a config-only change without bumping the image tag and the deployment had pulled :latest. The pod’s handler did a slow downstream call that occasionally took 70 seconds. Past the poll interval. So it kept getting kicked out of the group, which triggered a rebalance, which paused every other consumer. The page froze for users during what should have been a quiet Saturday afternoon. Twelve minutes of stale data on a visible surface. Cordoned the bad pod, drained the storm in about ninety seconds. SHA-pinned every Kafka-touching deployment after that.
The point isn’t the Kafka bug. We hadn’t decoupled anything. We’d taken a synchronous chain and added a delay loop in front of it. The coupling moved. It didn’t disappear.
We pulled back.
Not back to one giant Rails app, exactly. Back to a modular monolith. One deploy unit. One database. Modules that owned their own tables and exposed a Port-style interface to other modules. No cross-module foreign keys. No service mesh. No contract repo.
In code, it looked something like this in the NestJS variant we ended up using on a separate product later.
// modules/billing/billing.port.ts
import { Money } from '../shared/money';
export interface BillingPort {
chargeCustomer(input: {
customerId: string;
amount: Money;
idempotencyKey: string;
}): Promise<{ chargeId: string; status: 'succeeded' | 'pending' | 'failed' }>;
getOutstandingBalance(customerId: string): Promise<Money>;
}
export const BILLING_PORT = Symbol('BILLING_PORT');
// modules/checkout/checkout.service.ts
import { Inject, Injectable } from '@nestjs/common';
import { BILLING_PORT, BillingPort } from '../billing/billing.port';
import { OrdersRepository } from './orders.repository';
@Injectable()
export class CheckoutService {
constructor(
@Inject(BILLING_PORT) private readonly billing: BillingPort,
private readonly orders: OrdersRepository,
) {}
async placeOrder(input: PlaceOrderInput) {
const order = await this.orders.createPending(input);
const charge = await this.billing.chargeCustomer({
customerId: input.customerId,
amount: order.total,
idempotencyKey: `order:${order.id}`,
});
if (charge.status === 'failed') {
await this.orders.markFailed(order.id, charge.chargeId);
throw new PaymentFailed(order.id);
}
return this.orders.markConfirmed(order.id, charge.chargeId);
}
}
The Checkout module doesn’t import anything from modules/billing/ except the port. It doesn’t know which table billing writes to. It doesn’t have a foreign key to a billing row. When the billing team changes their internal schema, Checkout doesn’t care. That’s the contract.
The deploy shape went back to boring too. One web app, one worker, one database.
# docker-compose.yml (production parity, abridged)
services:
app:
image: ghcr.io/acme/platform:${GIT_SHA}
command: bin/server
ports: ['3000:3000']
environment:
DATABASE_URL: postgres://app:${PG_PASS}@db:5432/platform
REDIS_URL: redis://redis:6379/0
depends_on: [db, redis]
worker:
image: ghcr.io/acme/platform:${GIT_SHA}
command: bin/worker
environment:
DATABASE_URL: postgres://app:${PG_PASS}@db:5432/platform
REDIS_URL: redis://redis:6379/0
depends_on: [db, redis]
db:
image: postgres:16
volumes: ['pgdata:/var/lib/postgresql/data']
redis:
image: redis:7
volumes: { pgdata: {} }
The worker is the same image as the app, just a different command. No contract repo because there is no contract. There’s a database, and modules that share it under explicit rules.
If you’re in a Rails world, you can get a similar effect with engines or a Packwerk-style boundary file. Compile-time isolation, no service mesh required.
# packwerk.yml
require:
- packwerk
package_paths:
- 'app/packages/*'
# app/packages/billing/package.yml
enforce_dependencies: strict
enforce_privacy: strict
public_path: app/packages/billing/public/
dependencies:
- app/packages/identity
Strict dependencies, strict privacy, every package declares what it can talk to. The compiler tells you when someone reaches across a boundary. Try doing that across seven HTTP services without a Slack thread.
p99 stopped doing the heartbeat-monitor thing. The chained-RPC tax went away. Deploys went from four-PR ceremonies to single-PR ships. That eleven-day “add a string field” timing became roughly the same morning.
Outages got quieter. We still had incidents, of course. But the blast radius was contained to one deploy unit. One log stream to read. One runbook. The on-call rotation became sleepable again. Honestly that’s the metric I care about most.
I don’t think microservices are wrong. I’ve shipped on a hundreds-of-microservices topology before, at a federation platform I was acting CTO at. It worked there. It also worked at a real-time trading platform I architected, where the data plane handled an order of magnitude more concurrent connections than the rest of the product combined and needed its own service.
But notice what those two have in common. There was a named, specific reason. A scaling axis the rest of the system didn’t have, or a hard team boundary I could draw on the org chart without squinting.
The version that goes wrong is the one where the only reason is “we should be on microservices.” That’s an org-chart decision dressed up as a technical one. My rule now: extract a service when you can answer two questions without hand-waving. What’s the scaling axis? Which team owns it end to end? If either answer is “well, kind of,” stay in the monolith and put a module boundary there instead. You can always extract later. You almost never un-extract.
Thanks for reading. If you’ve got thoughts, send them my way.