Service Discovery in Microservices

An opinionated take on Consul, Eureka, and Kubernetes DNS-based discovery, with a real migration path off Consul and the health-check failure modes that bite in production.

It was a Saturday at the combat-sports tournament platform I CTO’d in London. Live federation broadcast going out, hundreds of microservices in flight, one consumer pod quietly running a stale image off :latest. The standings page froze at 14:32. Commentators noticed before we did. The root cause turned out to be a service-discovery problem dressed up as a Kafka rebalance, which is how a lot of these go. Discovery isn’t only about how A finds B. It’s about how the platform decides who’s alive, who gets traffic, how fast that opinion changes. Get those wrong and you ship outages.

Here’s my one-line opinion. For anything net-new on Kubernetes, use the cluster’s DNS and call it done. Reach for Consul only if you’re running cross-cluster, mixed VM-and-k8s fleets, or you genuinely need the KV store. Eureka I’d skip unless you’re already deep in JVM land.

The three flavors in play

Consul is a coordination plane. KV store, health checks, multi-datacenter gossip, ACLs. Agents on every node plus a server quorum, services talk to it directly or via a sidecar like Envoy. Powerful, operationally heavy. You’ll spend chunks of someone’s time keeping the cluster healthy.

Eureka came out of the JVM world and is built around client-side discovery: each service polls a registry, caches, load-balances locally. Tolerates registry outages by design. Also showing its age, and the tooling assumes a stack I don’t ship in.

Kubernetes DNS-based discovery is the boring default, and that’s a compliment. Declare a Service, the control plane wires it up, kube-proxy does the rest, your app resolves payments.default.svc.cluster.local. No agents. No extra cluster. Health checks come from the readiness and liveness probes you already write.

Client-side vs server-side discovery

This is the split that decides how your discovery layer behaves under stress.

Server-side discovery: the caller talks to a stable virtual IP, the platform routes to a healthy backend. Kubernetes Service objects are this. So is an AWS ALB in front of a target group. Caller stays simple. Platform owns membership.

Client-side discovery: each caller pulls the live backend list and picks one itself. Eureka, or Consul with a client library. Smarter caller, more failure-prone. If your discovery library has a bug, every service gets it at once.

I’ll take server-side every time unless I have a reason. At a creator-economy platform I worked at, we ran thousands of pods on AWS EKS and most service-to-service traffic resolved through Kubernetes DNS. Control plane owned membership, nobody wrote client-side load-balancer code. Right ratio.

Code: the Kubernetes default

A NestJS service calling another, end of story:

import { Module } from '@nestjs/common';
import { HttpModule } from '@nestjs/axios';

@Module({
  imports: [
    HttpModule.register({
      baseURL: 'http://payments.default.svc.cluster.local',
      timeout: 2000,
      maxRedirects: 0,
    }),
  ],
})
export class PaymentsClientModule {}

The Service manifest that makes that DNS name real:

apiVersion: v1
kind: Service
metadata:
  name: payments
  namespace: default
spec:
  selector:
    app: payments
  ports:
    - name: http
      port: 80
      targetPort: 3000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments
spec:
  replicas: 6
  selector:
    matchLabels:
      app: payments
  template:
    metadata:
      labels:
        app: payments
    spec:
      containers:
        - name: payments
          image: ghcr.io/acme/payments:sha-9f1c2a3
          ports:
            - containerPort: 3000
          readinessProbe:
            httpGet: { path: /healthz/ready, port: 3000 }
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet: { path: /healthz/live, port: 3000 }
            periodSeconds: 10
            failureThreshold: 3

That’s the whole pattern. Readiness gates traffic, liveness restarts wedged pods, the DNS name resolves to whatever endpoints are passing readiness. The orchestrator is the discovery system. Note the image tag: SHA, not :latest. We’ll come back to that.

Health checks are the actual product

Discovery is only as good as the signal it routes on. Most common failure mode I see: a readiness probe that returns 200 when the process is up but the dependencies aren’t.

A readiness probe I’d actually ship, in Nest:

import { Controller, Get, HttpCode, ServiceUnavailableException } from '@nestjs/common';
import { InjectDataSource } from '@nestjs/typeorm';
import { DataSource } from 'typeorm';
import { Redis } from 'ioredis';

@Controller('healthz')
export class HealthController {
  constructor(
    @InjectDataSource() private readonly db: DataSource,
    private readonly redis: Redis,
  ) {}

  @Get('live')
  @HttpCode(200)
  live() {
    return { status: 'ok' };
  }

  @Get('ready')
  async ready() {
    const checks = await Promise.allSettled([
      this.db.query('select 1'),
      this.redis.ping(),
    ]);
    const failed = checks.filter((c) => c.status === 'rejected');
    if (failed.length > 0) {
      throw new ServiceUnavailableException({ status: 'degraded' });
    }
    return { status: 'ready' };
  }
}

Liveness is “the process is responsive” and should not check dependencies. If your liveness probe verifies the database, you’ll restart pods during a database hiccup and turn a small incident into a cascade. Readiness is “this pod can serve traffic right now,” so dependency checks belong here. Confusing the two is one of the more expensive mistakes I see.

War story: the stale-image rebalance

Back to that Saturday at the federation platform. The standings-projector consumer group started rebalancing every thirty seconds. Standings updates stopped reaching the leaderboard. The on-call’s first move was a kubectl rollout restart, which got us right back into the same rebalance loop the group was running on itself.

A discovery problem with a Kafka-shaped symptom. One pod out of six ran a stale image because someone had pushed a config-touching fix without bumping the tag and the deployment had pulled :latest. That pod had max.poll.interval.ms set to 60s where the other five had 300s. Its handler did a downstream call that occasionally took around 70 seconds. So that pod kept getting kicked out of the group, forcing a rebalance for everyone.

Real fix: cordon the bad pod, drain the storm, ship a CI check that fails any deploy referencing :latest on a Kafka consumer manifest. About 12 minutes of stale standings on a public broadcast. Standing rule since: pin SHAs, never tags, on anything in a discovery or consumer group. Discovery and identity are the same problem in disguise.

War story: client side as the bug

A few years earlier at a real-time trading platform I architected, we had close to 10 million concurrent connections at peak. Socket.io, Node gateway tier, nginx in front. Tuesday after a long weekend, market open at 09:30. At 09:31:14 the connection pool started thrashing. Clients dropped, reconnected, dropped again, and within 90 seconds every gateway pod was pinned at 100% CPU.

I scaled gateway pods 3x with kubectl scale. Made it worse. New pods came online, saturated, and the higher count meant more partial-success reconnects, which clients read as “service is back” and slammed harder. Feeding the fire.

Real fix in two parallel moves. An emergency client-config push through a remote-config channel we’d built for this, with jittered exponential backoff, min 200ms, max 30s, factor 2, jitter +/- 50%. And a per-IP connection-rate limiter at nginx, set tight. About 8 minutes later, the pool stabilized. Server-side scaling can’t fix a self-amplifying client-side bug. Backoff and jitter live on the caller.

Migrating off Consul to DNS

If you’re on Consul on Kubernetes and the agents-plus-quorum tax is hurting, the path I’ve seen work is incremental. Every service gets proper readiness and liveness endpoints. Create a Service for it in-cluster and run both discovery paths in parallel, feature-flagged. Move callers one consumer at a time, watching error rates. Tear down Consul agents on migrated namespaces. Keep Consul only for cross-cluster or VM-resident services that need it. Discovery cutovers are the kind of change you cannot rehearse, so no big-bang.

Takeaways

Default to Kubernetes DNS for new microservices. Stop adding layers you don’t need.
Server-side discovery handles the boring 99 percent. Client-side libraries should only exist where you have a reason.
Readiness probes check dependencies. Liveness probes do not. Don’t confuse them.
Pin image SHAs on anything that participates in a discovery or consumer group.
Backoff and jitter live on the client. The platform cannot save you from a reconnect storm.
Migrate off Consul incrementally with a feature-flag-driven dual path. No big-bang cutovers.

Thanks for reading. If you’ve got thoughts, send them my way.