Microservice Antipatterns and How to Fix Them

Distributed monoliths, nano-services, and shared-library coupling. Detection signals and remediation steps pulled from real architecture reviews.

The first time I really understood “distributed monolith” was on a Saturday afternoon at the combat-sports tournament platform I CTO’d in London. A live broadcast was on. The standings page froze at 14:32 local time. I’d hired and built the engineering org over three years and we had hundreds of microservices in production. None of that mattered at 14:32. One stale pod in one consumer group had taken down a public-facing leaderboard during a federation event. That afternoon taught me the same thing every microservice antipattern teaches you eventually. The boundary you drew on the whiteboard isn’t the boundary you got in production.

I want to walk through three antipatterns I keep seeing in architecture reviews. Distributed monoliths. Nano-services. Shared-library coupling. They’re not the only three, but they cover most of the pain. For each one I’ll give you the signal I look for, the actual fix, and a real moment where it bit me.

Distributed monolith, the usual one

A distributed monolith looks like microservices on the architecture diagram and ships like a monolith in CI. You can’t deploy service A without service B going out the same hour. A schema change in users requires a coordinated release across four other services. Your “service mesh” is really one big release train pretending to be independent.

The signal I look for first is the shape of your deploys. If most PRs touch two or more services, you don’t have services. You have one app that happens to live in seven repos. Second signal is your test pipeline. If your CI for service A spins up services B, C, and D to run “integration tests,” you’ve already coupled them at the API boundary. Real services are tested against a contract, not against a live neighbor.

The fix is unglamorous. Find the chatty boundaries and either merge them back or stabilize the contract between them with a versioned schema. I like Protobuf or JSON Schema in a shared contracts/ repo that producers publish to and consumers pin to a version. Here’s the consumer side I usually start with.

import { Injectable, Logger, OnModuleInit } from '@nestjs/common';
import { Kafka, Consumer, EachMessagePayload } from 'kafkajs';
import { validateOrderCreatedV2, OrderCreatedV2 } from '@org/contracts';

@Injectable()
export class OrderEventConsumer implements OnModuleInit {
  private readonly logger = new Logger(OrderEventConsumer.name);
  private consumer: Consumer;

  async onModuleInit() {
    const kafka = new Kafka({
      clientId: 'fulfillment-service',
      brokers: process.env.KAFKA_BROKERS!.split(','),
    });
    this.consumer = kafka.consumer({
      groupId: 'fulfillment.order-events.v2',
      maxWaitTimeInMs: 200,
      sessionTimeout: 30_000,
    });

    await this.consumer.connect();
    await this.consumer.subscribe({ topic: 'order.created', fromBeginning: false });
    await this.consumer.run({ eachMessage: (p) => this.handle(p) });
  }

  private async handle({ message }: EachMessagePayload) {
    const raw = JSON.parse(message.value!.toString());
    if (raw.schemaVersion !== 2) {
      this.logger.warn({ msg: 'skipping non-v2 event', version: raw.schemaVersion });
      return;
    }
    const event: OrderCreatedV2 = validateOrderCreatedV2(raw);
    await this.fulfill(event);
  }
}

The version check is the part that buys you independence. The producer can roll v3 out, consumers ignore it until they’re ready, nobody schedules a four-team deploy meeting on a Friday.

War story. At a creator economy platform I worked at, we had a renewal webhook handler that called three internal services in sequence before responding to Apple. One day Apple’s SubscriptionRenewal server-to-server notification started arriving while one of those three services was slow, and we returned 200 OK slightly after Apple’s 30-second deadline. Apple retried. We had no idempotency on the renewal handler. Every retry created a new subscription row. A few thousand customers across dozens of branded apps got billed twice. The frontend “fix” the on-call shipped within an hour just hid the duplicate row, which did nothing about the duplicate charges. The real fix was to enqueue the work asynchronously, return 200 OK within 5 seconds, and put a unique constraint on (apple_original_transaction_id, notification_uuid) so Apple’s retries became idempotent at the queue. That’s a synchronous distributed monolith in action. Three “independent” services chained behind a webhook with no contract for what failure looks like.

Nano-services, the other extreme

The opposite mistake. Someone read a Martin Fowler post, decided every domain noun deserves its own service, and now you have a discount-calculator service that does one multiplication. You pay the full operational cost, a deploy pipeline, a Datadog dashboard, an on-call rotation, alerting, log retention, and for what? A function.

Signal: a service whose entire job could be a 40-line module in the caller. Signal two: the service has no schema of its own. It reads someone else’s data and returns a derived value. Signal three: the on-call doc is one sentence.

The fix is to merge it back into the caller. Yes, this feels like going backwards. Do it anyway. A service exists to encapsulate state and a team. If it owns neither, it’s a function pretending to be infrastructure.

// before: a network hop for a multiplication
const discount = await axios
  .post(`${DISCOUNT_SVC}/calc`, { items, tier })
  .then((r) => r.data.amount);

// after: same logic, in-process
import { calculateDiscount } from '@org/pricing-core';
const discount = calculateDiscount(items, tier);

The library version is faster, easier to debug, easier to test, and removes a failure mode. If pricing-core gets complicated later and grows real state, you split it back out. Splitting later is cheap. Splitting too early is expensive.

Shared-library coupling, the silent one

Now the trickier antipattern. You have real services with real boundaries. Good. They all depend on @org/[email protected], a shared library with helpers, types, logging setup, auth middleware, and “just a few” data access utilities. The minute that library touches data or business logic, you’ve recoupled everything at the runtime layer. A breaking change in @org/common means a coordinated upgrade across every service.

The signal here is your dependency graph. Run npm ls @org/common across every service. If they’re all on different versions, you already have the bug. If they’re all on the same version because nobody can upgrade independently, you have the worse version of the bug.

The remediation is boring. Split the library by concern. Pure types and contracts in one package. HTTP middleware in another. Logging in a third. Any business logic gets its own service or moves into the caller. Then make every package strictly additive between major versions, the way real OSS libraries do.

# .github/workflows/contracts-publish.yml
name: contracts-publish
on:
  push:
    branches: [main]
    paths: ['packages/contracts/**']

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - run: pnpm install --frozen-lockfile
      - run: pnpm --filter @org/contracts build
      - run: pnpm --filter @org/contracts test
      - run: pnpm --filter @org/contracts publish --no-git-checks
        env:
          NPM_TOKEN: ${{ secrets.NPM_TOKEN }}

War story two. A real-time trading platform I architected on Socket.io and Node.js had a “shared utils” package that included the reconnect-config helper. Market opened on a Tuesday after a long weekend, the reconnect storm hit at 09:31:14, every gateway pod pinned at 100% CPU within 90 seconds, p99 fan-out latency went from around 80 ms to about 3 seconds. The fix lived in a remote-config push, jittered exponential backoff with min 200 ms, max 30 seconds, factor 2, plus a per-IP rate limit of three new connections per second per IP at nginx. Took about 8 minutes to stabilize once we shipped both. But the deeper problem was that backoff config sat in a shared library every service imported. The “fix” to that library would have required a coordinated client release. We routed around it with remote config, which is fine for an incident. The real fix afterwards was pulling reconnect policy out of the shared lib and into a per-service config served at startup.

Takeaways

If two services can’t deploy independently, they aren’t two services.
Versioned contracts in a separate repo beat shared runtime libraries every time.
A service that owns no state is a function. Inline it until it grows state.
Test against contracts, not live neighbors.
Backoff, reconnect, retry policy belongs in config, not in shared code.
Splitting late is cheaper than splitting early.

Thanks for reading. If you’ve got thoughts, send them my way.