Service Mesh in Production

An opinionated take on Istio versus Linkerd, when sidecars earn their resource overhead, and when a service mesh is just expensive YAML.

The combat-sports tournament platform I CTO’d in London ran on hundreds of microservices in production. Kafka was the backbone, the team had grown over a few years, and one Saturday afternoon during a live broadcast we had a consumer group rebalance into the floor and froze the public standings for twelve minutes. Someone in the postmortem the next week asked the obvious thing. Would a service mesh have caught this. Honest answer: no, not even a little. I’ve been chewing on that question ever since, across a real-time trading platform I architected, a live-video creator startup I led, and the creator economy platform I spent the last few years at.

So here’s where I land.

My position up front

Run Linkerd if you actually need a mesh. Skip Istio unless you’re a platform team with the headcount to operate it as a product. And honestly, most NestJS shops do not need a mesh at all. They need a decent HTTP client with retries, timeouts, and a circuit breaker, plus Datadog APM and structured logs. Sidecars are not free. The injection of a proxy next to every pod costs you CPU, memory, latency on the hot path, and a non-trivial chunk of your on-call attention. If you can’t name three problems mesh solves that you have right now, you don’t have a mesh problem.

What a mesh actually buys you

Three things, really. mTLS between services with rotation handled for you. L7 traffic management, meaning retries, timeouts, circuit breaking, traffic splitting, canary routing, all configured outside your app code. And uniform observability, golden signals on every hop without instrumenting each service.

The first one matters in regulated industries. The second matters when you have heterogeneous languages and no one shared HTTP client library. The third matters everywhere but most teams already have a decent slice of it through Datadog APM or OpenTelemetry SDKs.

If you’re a TypeScript and NestJS shop with maybe a Python worker tier on the side, you can get the L7 stuff in app code for less operational cost than a mesh. Like this.

import { Module, HttpException, HttpStatus } from '@nestjs/common';
import { HttpModule, HttpService } from '@nestjs/axios';
import axiosRetry from 'axios-retry';
import CircuitBreaker from 'opossum';
import { firstValueFrom } from 'rxjs';

@Module({
  imports: [
    HttpModule.registerAsync({
      useFactory: () => ({
        timeout: 1500,
        maxRedirects: 0,
      }),
    }),
  ],
})
export class RankingsClientModule {}

export class RankingsClient {
  private breaker: CircuitBreaker;

  constructor(private readonly http: HttpService) {
    axiosRetry(this.http.axiosRef, {
      retries: 3,
      retryDelay: (attempt) => Math.min(200 * 2 ** attempt, 2000),
      retryCondition: (err) =>
        axiosRetry.isNetworkOrIdempotentRequestError(err) ||
        err.response?.status === 503,
    });

    this.breaker = new CircuitBreaker(
      (athleteId: string) =>
        firstValueFrom(
          this.http.get(`/rankings/${athleteId}`),
        ),
      {
        timeout: 1500,
        errorThresholdPercentage: 50,
        resetTimeout: 10_000,
      },
    );
  }

  async getRank(athleteId: string) {
    try {
      const { data } = await this.breaker.fire(athleteId);
      return data;
    } catch (err) {
      throw new HttpException('rankings_unavailable', HttpStatus.SERVICE_UNAVAILABLE);
    }
  }
}

That’s retries, timeouts, and a circuit breaker per dependency, in a NestJS HTTP client. No sidecar. No CRD. If every team in your org wires up the same module, you have 80% of what an L7 mesh gives you, with no extra infra. It also fails in ways your engineers already know how to debug.

Istio versus Linkerd

I’ll just say it. Istio is over-engineered for almost every team I’ve seen reach for it. The control plane is a beast, the CRD surface is enormous, the upgrade story is rough, and the Envoy sidecar adds real per-request latency. I’ve watched a platform team spend six months getting Istio production-ready and another three months stabilizing it after the first major version bump.

Linkerd is the opposite shape. Rust-based proxy, much smaller sidecar footprint, mTLS on by default, sane defaults, smaller CRD surface, and an operational burden you can hold in one engineer’s head. You give up some of Istio’s traffic-management bells, but the ones you give up are mostly the ones you shouldn’t be running anyway.

Here’s a Linkerd service profile that does retries and timeouts at the mesh layer for a NestJS rankings service.

apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: rankings.federation.svc.cluster.local
  namespace: federation
spec:
  routes:
    - name: GET /rankings/{id}
      condition:
        method: GET
        pathRegex: /rankings/[^/]+
      timeout: 1500ms
      isRetryable: true
  retryBudget:
    retryRatio: 0.2
    minRetriesPerSecond: 5
    ttl: 10s

Istio’s equivalent is a VirtualService plus a DestinationRule, plus a PeerAuthentication, plus possibly an EnvoyFilter to make the retries behave the way you actually want. Three or four CRDs versus one. Multiply that across forty services and ask yourself who’s owning that config drift.

When sidecars don’t pay

Here’s the thing nobody puts in the Istio marketing. The sidecar runs in every pod. At a creator economy platform I spent the last few years at, the production cluster had thousands of pods. If you stick an Envoy sidecar next to every one of them with the default CPU/memory request, you’ve roughly doubled the resource footprint of the cluster before you’ve added a feature. That’s not a hypothetical. That’s the line you go defend in the cost-review meeting.

Linkerd’s proxy is meaningfully smaller, so the math gets easier. But it’s still not free. And here’s the harder question. If you have around twenty services, you probably do not have a mesh-shaped problem. You have a “we should standardize our HTTP client” problem, which is a Tuesday afternoon of work, not a quarter.

A useful smell test. If your team can name every running service from memory, you do not need a mesh. If service discovery is a real problem, mTLS is an audit requirement, and you have at least three runtimes in production, then yeah, install Linkerd.

A mesh would not have saved this

Back to the Saturday afternoon I opened with. The combat-sports platform, third bout of the day, the standings-projector consumer group rebalancing every thirty seconds or so. The match-events Kafka topic kept growing on the broker, but updates stopped reaching the public leaderboard. Page froze at 14:32 local time during a live broadcast. Within two minutes our SRE Slack had three PagerDuty pages and the federation’s tech contact pinged me directly.

First wrong fix was a kubectl rollout restart deployment/standings-projector, hoping consumers would re-join cleanly. They did. Then they triggered another rebalance about forty seconds later. We were doing the same dance the group was already doing on its own.

Real fix took a side-by-side log diff. One pod out of six had a different max.poll.interval.ms value, 300s on five of them, 60s on the sixth. The sixth pod was running a stale container image because someone had pushed a config-touching fix without bumping the image tag and the deployment had pulled :latest. The pod’s handler did a slow downstream call that occasionally took around seventy seconds, past its max.poll.interval.ms, so it kept getting kicked out of the group, causing rebalances for everyone. Cordoned the bad pod, storm drained in about ninety seconds.

Twelve minutes of stale standings during a live broadcast. The standing deploy rule from that day: pin image SHAs, never tags, on anything that touches a Kafka consumer group. A service mesh sits on the network, not on Kafka’s consumer protocol. It would have given me a prettier dashboard during the outage, not a faster fix.

Another war story, same shape

A real-time trading and charting platform I architected, designed for very high concurrent connection counts at peak. Socket.io on a Node.js gateway tier behind nginx. The Tuesday after a long bank-holiday weekend, the market opened at 09:30 local and seventy-four seconds later connections started dropping en masse. Clients reconnected immediately, were dropped again, reconnected again. Within ninety seconds the reconnect storm had every gateway pod pinned at 100% CPU. p99 tick fan-out latency went from around 80ms to around 3s.

First wrong fix: kubectl scale straight to nine pods through the autoscaler’s manual override. The new pods came online, hit the reconnect storm head-on, and went CPU-bound within twenty seconds of joining the pool. I was feeding the fire.

Real fix in two parallel moves. A client-side config push for jittered exponential backoff, min: 200ms, max: 30s, factor: 2, jitter: 50%. And a per-IP connection-rate limiter at the nginx layer at three new connections per second per IP. Pool stabilized in about eight minutes.

Could a mesh have intermediated this. Technically yes, Istio has connection-rate limiting via Envoy. Practically no, because the bug was on the client and a mesh between the gateway and the client does not exist. The mesh would have given me retries on the wrong leg of the call. Worth remembering. A mesh is for east-west traffic inside the cluster. It does not help you north-south, where most of your customer pain actually lives.

Health checks and observability

If you do run a mesh, your services still need proper health checks, otherwise the mesh’s load balancer is happily routing to half-dead pods. NestJS gives you @nestjs/terminus. Use it.

import { Controller, Get } from '@nestjs/common';
import {
  HealthCheck,
  HealthCheckService,
  TypeOrmHealthIndicator,
} from '@nestjs/terminus';

@Controller('healthz')
export class HealthController {
  constructor(
    private readonly health: HealthCheckService,
    private readonly db: TypeOrmHealthIndicator,
  ) {}

  @Get('liveness')
  @HealthCheck()
  liveness() {
    return this.health.check([]);
  }

  @Get('readiness')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('postgres', { timeout: 500 }),
    ]);
  }
}

Liveness is “am I alive”, readiness is “should I get traffic”. Conflate them and your mesh will keep sending requests to a pod that’s drained its DB pool. Whether you’re on Linkerd, Istio, or no mesh at all, that one’s on you.

Takeaways

Default to no mesh. A NestJS HTTP client with retries, timeouts, and a circuit breaker covers most teams.
If you do need a mesh, pick Linkerd. Smaller proxy, smaller CRD surface, sane defaults, mTLS on by default.
Istio is for platform teams that can own it as a product. If that’s not you, the operational cost will eat you.
A mesh fixes east-west, not north-south. Reconnect storms, broker rebalances, and database lag are not network problems.
Keep liveness and readiness probes honest. The mesh trusts them blindly.
If your team can list every running service from memory, you do not have a mesh problem yet.

Thanks for reading. If you’ve got thoughts, send them my way.