Choosing a NestJS Microservice Transport

A senior engineer's opinionated take on TCP, Redis, NATS, gRPC, and Kafka transports in NestJS, and when Kafka actually earns its complexity.

At the combat-sports tournament platform I CTO’d in London, the call to standardize async comms on Kafka took us about three weeks of arguing. Hundreds of microservices in flight, a team I’d hired and grown over a few years, and a rankings page that had to update standings in real time during live broadcasts. We tried Redis pub/sub first for two services. Then RabbitMQ for one. The thing that finally got everyone in the same room was a Saturday afternoon outage I’ll tell you about further down. After that the answer was Kafka, and the answer has been Kafka ever since.

NestJS gives you a transport-agnostic microservice abstraction. @MessagePattern, @EventPattern, ClientProxy, the works. The abstraction is good. It also makes it easy to pick the wrong transport because every docs example looks roughly the same. So let me cut through it.

My one-sentence opinion

Default to Redis. Reach for NATS when you outgrow Redis’s semantics. Use gRPC when you need typed sync calls between services your team owns. Pay the Kafka tax only when you need durable replay, ordering per key, or multiple independent consumers reading the same stream. TCP is for demos and unit tests.

That’s it. The rest of this post is why.

TCP

NestJS ships a TCP transport that’s the default in every Getting Started example. Fine for a local dev loop. Not a production transport.

// main.ts
import { NestFactory } from '@nestjs/core';
import { MicroserviceOptions, Transport } from '@nestjs/microservices';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.createMicroservice<MicroserviceOptions>(
    AppModule,
    {
      transport: Transport.TCP,
      options: { host: '0.0.0.0', port: 8877 },
    },
  );
  await app.listen();
}
bootstrap();

No backpressure handling, no broker, no replay, no load balancing across consumer instances unless you put a proxy in front of it. You’ll spend a week reinventing a worse Redis. Skip it.

Redis

For most NestJS services I’ve shipped, Redis pub/sub on the existing cache cluster handles 90% of the async work and adds no new infra. On a couple of side products I CTO, I leaned on it for everything that wasn’t a billing or audit event.

import { Module } from '@nestjs/common';
import { ClientsModule, Transport } from '@nestjs/microservices';

@Module({
  imports: [
    ClientsModule.register([
      {
        name: 'EVENTS',
        transport: Transport.REDIS,
        options: {
          host: process.env.REDIS_HOST,
          port: 6379,
          retryAttempts: 5,
          retryDelay: 1000,
        },
      },
    ]),
  ],
})
export class EventsModule {}

Where Redis stops being right: you need durable replay (Redis pub/sub forgets the moment a message is fanned out), you need ordering guarantees per partition key, or you need three different services to read the same stream independently without one of them eating the others’ messages. Redis Streams covers some of this. Once you’re shopping for Streams plus consumer groups plus DLQ tooling, you’re already shopping for Kafka or NATS, you just don’t know it yet.

NATS

If I had to pick a single transport for a brand-new NestJS service today, it’d be NATS. JetStream gives you durable streams and consumer groups. The operational footprint is a fraction of Kafka’s. The NestJS adapter is solid.

import { Controller } from '@nestjs/common';
import { EventPattern, Payload, Ctx, NatsContext } from '@nestjs/microservices';

interface OrderCreated {
  orderId: string;
  customerId: string;
  totalCents: number;
}

@Controller()
export class OrderEventsController {
  @EventPattern('orders.created')
  async onOrderCreated(
    @Payload() event: OrderCreated,
    @Ctx() ctx: NatsContext,
  ) {
    const subject = ctx.getSubject();
    // idempotency key comes from headers, not the body
    const messageId = ctx.getHeaders()?.get('Nats-Msg-Id');
    if (!messageId) {
      throw new Error(`Refusing to process ${subject} without idempotency key`);
    }
    await this.handler.process(event, messageId);
  }
}

The thing nobody warns you about with NATS is core NATS versus JetStream. Core NATS is fire-and-forget, like Redis pub/sub with better ergonomics. JetStream is the durable one. Picking core NATS for a payment notification path is the kind of mistake you make once.

gRPC

I reach for gRPC when two services my team owns need to talk synchronously and I want a real contract. Strongly typed messages, generated clients, deadlines that actually propagate. Not async. Not events. Just RPC done right.

import { Controller } from '@nestjs/common';
import { GrpcMethod } from '@nestjs/microservices';
import { Observable, from } from 'rxjs';

interface RankingsRequest { athleteId: string; }
interface RankingsResponse { rank: number; division: string; }

@Controller()
export class RankingsController {
  @GrpcMethod('RankingsService', 'GetRank')
  getRank(req: RankingsRequest): Observable<RankingsResponse> {
    return from(this.rankings.lookup(req.athleteId));
  }
}

Where it falls down: cross-team service calls in big orgs. The .proto file becomes a coupling point, and unless you have schema review discipline, breaking changes leak through. For service-to-service in a single team’s bounded context, gRPC is great. Across teams I’d default to HTTP/JSON with a published OpenAPI spec and accept the typing tax.

Kafka

Saturday afternoon, the federation platform I was CTO at. A live combat-sports tournament being broadcast publicly. Federations and commentators watching the standings page in real time. Around the third bout of the day, the standings-projector consumer group started rebalancing every thirty seconds or so. The match-events topic kept growing on the broker, but standings updates stopped reaching the public leaderboard. The page froze at 14:32 local time during a live broadcast. Within two minutes our SRE Slack had three PagerDuty pages and the federation’s tech contact pinged me directly.

First wrong fix: kubectl rollout restart deployment/standings-projector, hoping consumers would re-join cleanly. They did. Then they triggered another rebalance about forty seconds later. We were doing the same dance the group was already doing on its own.

Real fix: pulled pod logs side by side. One pod out of six had a different max.poll.interval.ms value, 300s on five of them, 60s on the sixth. The sixth pod was running a stale container image. Someone had pushed a config-touching fix without bumping the image tag, and the deployment had pulled :latest. The pod’s handler did a slow downstream call to a federation-rules service that occasionally took around seventy seconds, past its max.poll.interval.ms, so it kept getting kicked out of the group, causing rebalances for everyone. Cordoned the bad pod, storm drained in about ninety seconds.

Twelve minutes of stale standings during a live broadcast. Federation was understanding. Commentators were less so. The standing deploy rule from that day: pin image SHAs, never tags, on anything that touches a Kafka consumer group. CI fails the deploy if the manifest references :latest for any consumer pod.

That story is also why I picked Kafka in the first place. Replay. Per-key ordering on match_id. Three independent consumer groups reading the same match-events topic, one for standings, one for analytics, one for the public feed. None of the other transports give you that without major contortions.

import { Controller } from '@nestjs/common';
import { EventPattern, Payload, Ctx, KafkaContext } from '@nestjs/microservices';

@Controller()
export class MatchEventsController {
  @EventPattern('match-events')
  async onMatchEvent(
    @Payload() event: { matchId: string; type: string; payload: unknown },
    @Ctx() ctx: KafkaContext,
  ) {
    const { offset, partition, topic } = {
      offset: ctx.getMessage().offset,
      partition: ctx.getPartition(),
      topic: ctx.getTopic(),
    };
    await this.projector.apply(event, { topic, partition, offset });
    // commit happens via @nestjs/microservices auto-commit
    // but you can also do manual commits via ctx.getConsumer()
  }
}

The cost is real. You’re operating brokers, ZooKeeper or KRaft, a schema registry if you’re disciplined, consumer-lag dashboards, DLQ tooling, and a runbook for the rebalance storm I just told you about. If you don’t need durable replay, per-key ordering, or independent fan-out, you’re paying that tax for nothing.

Takeaways

Default to Redis. It’s already in your stack and it does most async jobs well.
Reach for NATS when you need durable streams or consumer groups without paying for Kafka.
Use gRPC for typed sync calls inside a single team’s bounded context. Not across teams.
Pay the Kafka tax only when you need durable replay, per-key ordering, or independent fan-out. All three, ideally.
Pin image SHAs on every consumer pod. :latest will cost you a live broadcast someday.
TCP is for examples. Don’t ship it.

Thanks for reading. If you’ve got thoughts, send them my way.