An opinionated take on Consul, Eureka, and Kubernetes DNS-based discovery, with a real migration path off Consul and the health-check failure modes that bite in production.
It was a Saturday at the combat-sports tournament platform I CTO’d in London. Live federation broadcast going out, hundreds of microservices in flight, one consumer pod quietly running a stale image off :latest. The standings page froze at 14:32. Commentators noticed before we did. The root cause turned out to be a service-discovery problem dressed up as a Kafka rebalance, which is how a lot of these go. Discovery isn’t only about how A finds B. It’s about how the platform decides who’s alive, who gets traffic, how fast that opinion changes. Get those wrong and you ship outages.
Here’s my one-line opinion. For anything net-new on Kubernetes, use the cluster’s DNS and call it done. Reach for Consul only if you’re running cross-cluster, mixed VM-and-k8s fleets, or you genuinely need the KV store. Eureka I’d skip unless you’re already deep in JVM land.
Consul is a coordination plane. KV store, health checks, multi-datacenter gossip, ACLs. Agents on every node plus a server quorum, services talk to it directly or via a sidecar like Envoy. Powerful, operationally heavy. You’ll spend chunks of someone’s time keeping the cluster healthy.
Eureka came out of the JVM world and is built around client-side discovery: each service polls a registry, caches, load-balances locally. Tolerates registry outages by design. Also showing its age, and the tooling assumes a stack I don’t ship in.
Kubernetes DNS-based discovery is the boring default, and that’s a compliment. Declare a Service, the control plane wires it up, kube-proxy does the rest, your app resolves payments.default.svc.cluster.local. No agents. No extra cluster. Health checks come from the readiness and liveness probes you already write.
This is the split that decides how your discovery layer behaves under stress.
Server-side discovery: the caller talks to a stable virtual IP, the platform routes to a healthy backend. Kubernetes Service objects are this. So is an AWS ALB in front of a target group. Caller stays simple. Platform owns membership.
Client-side discovery: each caller pulls the live backend list and picks one itself. Eureka, or Consul with a client library. Smarter caller, more failure-prone. If your discovery library has a bug, every service gets it at once.
I’ll take server-side every time unless I have a reason. At a creator-economy platform I worked at, we ran thousands of pods on AWS EKS and most service-to-service traffic resolved through Kubernetes DNS. Control plane owned membership, nobody wrote client-side load-balancer code. Right ratio.
A NestJS service calling another, end of story:
import { Module } from '@nestjs/common';
import { HttpModule } from '@nestjs/axios';
@Module({
imports: [
HttpModule.register({
baseURL: 'http://payments.default.svc.cluster.local',
timeout: 2000,
maxRedirects: 0,
}),
],
})
export class PaymentsClientModule {}
The Service manifest that makes that DNS name real:
apiVersion: v1
kind: Service
metadata:
name: payments
namespace: default
spec:
selector:
app: payments
ports:
- name: http
port: 80
targetPort: 3000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments
spec:
replicas: 6
selector:
matchLabels:
app: payments
template:
metadata:
labels:
app: payments
spec:
containers:
- name: payments
image: ghcr.io/acme/payments:sha-9f1c2a3
ports:
- containerPort: 3000
readinessProbe:
httpGet: { path: /healthz/ready, port: 3000 }
periodSeconds: 5
failureThreshold: 2
livenessProbe:
httpGet: { path: /healthz/live, port: 3000 }
periodSeconds: 10
failureThreshold: 3
That’s the whole pattern. Readiness gates traffic, liveness restarts wedged pods, the DNS name resolves to whatever endpoints are passing readiness. The orchestrator is the discovery system. Note the image tag: SHA, not :latest. We’ll come back to that.
Discovery is only as good as the signal it routes on. Most common failure mode I see: a readiness probe that returns 200 when the process is up but the dependencies aren’t.
A readiness probe I’d actually ship, in Nest:
import { Controller, Get, HttpCode, ServiceUnavailableException } from '@nestjs/common';
import { InjectDataSource } from '@nestjs/typeorm';
import { DataSource } from 'typeorm';
import { Redis } from 'ioredis';
@Controller('healthz')
export class HealthController {
constructor(
@InjectDataSource() private readonly db: DataSource,
private readonly redis: Redis,
) {}
@Get('live')
@HttpCode(200)
live() {
return { status: 'ok' };
}
@Get('ready')
async ready() {
const checks = await Promise.allSettled([
this.db.query('select 1'),
this.redis.ping(),
]);
const failed = checks.filter((c) => c.status === 'rejected');
if (failed.length > 0) {
throw new ServiceUnavailableException({ status: 'degraded' });
}
return { status: 'ready' };
}
}
Liveness is “the process is responsive” and should not check dependencies. If your liveness probe verifies the database, you’ll restart pods during a database hiccup and turn a small incident into a cascade. Readiness is “this pod can serve traffic right now,” so dependency checks belong here. Confusing the two is one of the more expensive mistakes I see.
Back to that Saturday at the federation platform. The standings-projector consumer group started rebalancing every thirty seconds. Standings updates stopped reaching the leaderboard. The on-call’s first move was a kubectl rollout restart, which got us right back into the same rebalance loop the group was running on itself.
A discovery problem with a Kafka-shaped symptom. One pod out of six ran a stale image because someone had pushed a config-touching fix without bumping the tag and the deployment had pulled :latest. That pod had max.poll.interval.ms set to 60s where the other five had 300s. Its handler did a downstream call that occasionally took around 70 seconds. So that pod kept getting kicked out of the group, forcing a rebalance for everyone.
Real fix: cordon the bad pod, drain the storm, ship a CI check that fails any deploy referencing :latest on a Kafka consumer manifest. About 12 minutes of stale standings on a public broadcast. Standing rule since: pin SHAs, never tags, on anything in a discovery or consumer group. Discovery and identity are the same problem in disguise.
A few years earlier at a real-time trading platform I architected, we had close to 10 million concurrent connections at peak. Socket.io, Node gateway tier, nginx in front. Tuesday after a long weekend, market open at 09:30. At 09:31:14 the connection pool started thrashing. Clients dropped, reconnected, dropped again, and within 90 seconds every gateway pod was pinned at 100% CPU.
I scaled gateway pods 3x with kubectl scale. Made it worse. New pods came online, saturated, and the higher count meant more partial-success reconnects, which clients read as “service is back” and slammed harder. Feeding the fire.
Real fix in two parallel moves. An emergency client-config push through a remote-config channel we’d built for this, with jittered exponential backoff, min 200ms, max 30s, factor 2, jitter +/- 50%. And a per-IP connection-rate limiter at nginx, set tight. About 8 minutes later, the pool stabilized. Server-side scaling can’t fix a self-amplifying client-side bug. Backoff and jitter live on the caller.
If you’re on Consul on Kubernetes and the agents-plus-quorum tax is hurting, the path I’ve seen work is incremental. Every service gets proper readiness and liveness endpoints. Create a Service for it in-cluster and run both discovery paths in parallel, feature-flagged. Move callers one consumer at a time, watching error rates. Tear down Consul agents on migrated namespaces. Keep Consul only for cross-cluster or VM-resident services that need it. Discovery cutovers are the kind of change you cannot rehearse, so no big-bang.
Thanks for reading. If you’ve got thoughts, send them my way.