Our WebSocket Server Melted at 500K Connections

A live sports final pushed our Socket.io gateway past the point of recovery. Here's how we rebuilt it on AnyCable, topic-sharded fan-out, and Redis Streams.

It was a Saturday evening, the final of a tournament our product was streaming live, and somewhere around 500K concurrent connections the gateway tier started shedding clients in waves. This was at a real-time trading and charting platform I architected a few years back, except the tick stream that night wasn’t market data, it was a live sports broadcast we’d onboarded as a new vertical for the same gateway. Same Socket.io on top of a Node.js/TypeScript stack behind nginx. Same on-call rotation. Different content, same failure mode.

I’d love to tell you we caught it on a dashboard. We caught it on Slack. The federation’s tech contact pinged me directly within two minutes of the leaderboard going stale, and the war room channel hit the fire-emoji wall about ten seconds later.

The night the gateway tier folded

The pattern was textbook reconnection storm. Clients dropping en masse, reconnecting immediately, dropping again. Within 90 seconds every gateway pod was pinned at 100% CPU. p99 tick fan-out latency went from around 80 ms to roughly 3 seconds. Charts and scoreboards on the client started showing stale data, which is the worst failure mode for a live event because nobody knows if they’re seeing reality or a frozen frame.

I was on-call that day. I remember thinking, alright, we built for around 10M concurrent connections in theory, and we’re folding at 500K in practice. That’s not a capacity problem. That’s a topology problem.

Why Socket.io stopped fitting

The setup was conventional. Sticky-session affinity at the load balancer, each gateway pod terminating WebSockets, a Redis pub/sub channel per topic for fan-out, the app process doing both connection management and business logic on the same event loop. When the room is small this is fine. When the room is half a million people watching the same finals stream, the same event loop that handles your WebSocket frames is also doing your auth, your rate limiting, your message serialization, and your downstream calls.

// the old fan-out shape, simplified
import { createAdapter } from '@socket.io/redis-adapter'
import { createClient } from 'redis'
import { Server } from 'socket.io'

const pubClient = createClient({ url: process.env.REDIS_URL })
const subClient = pubClient.duplicate()
await Promise.all([pubClient.connect(), subClient.connect()])

const io = new Server({
  adapter: createAdapter(pubClient, subClient),
  transports: ['websocket'],
})

io.on('connection', (socket) => {
  socket.on('subscribe', (topic) => socket.join(topic))
})

// every emit goes through pub/sub. fan-out scales with the largest room.
function broadcastTick(topic, payload) {
  io.to(topic).emit('tick', payload)
}

Redis pub/sub is also fire-and-forget. No durability, no consumer groups, no replay. If a pod blips and a client reconnects, anything that flew past during the blip is just gone. For a chart you can squint at, that’s annoying. For a live scoreboard, that’s a credibility hit.

The bad first fix we shipped

First instinct was to scale gateway pods 3x via the autoscaler’s manual override. kubectl scale straight from six to eighteen. New pods came up, hit the reconnect storm head-on, and went CPU-bound within around 20 seconds of joining the pool. I was feeding the fire.

Worse, the higher pod count meant more partial-success reconnects. Clients landed on a healthy pod, got the connection-established signal, then dropped again when that pod saturated. Same client, three pods, three failed sessions in the span of a minute. Each failure triggered another reconnect. The storm got louder.

The real fix during the incident itself was a two-step. First, an emergency client-side config push through the remote-config channel we’d built for moments like this. Jittered exponential backoff: min 200ms, max 30s, factor 2, jitter plus-or-minus 50%. Second, a per-IP connection-rate limiter at the nginx layer, set deliberately tight at three new connections per second per IP. Within about eight minutes the pool stabilized and tick fan-out came back under 200 ms.

That bought us the night. It didn’t fix the topology.

Moving to AnyCable

In the postmortem we made the call to stop running production WebSockets out of the application process. We picked AnyCable. The pitch is simple. AnyCable terminates WebSockets in a Go process, communicates with your app server over gRPC, and lets the app do auth and channel logic without owning the socket lifecycle. The Go layer holds the connections cheaply. The app server stops being a WebSocket server and goes back to being an app server.

This wasn’t a Rails shop, but we had a sister product on the creator platform I worked at that ran Rails on AnyCable, and the operational story there was boring in the good way. So we adapted the model. The gateway became a thin AnyCable terminator, and the business logic lived in a separate Node service the Go layer talked to over gRPC.

A minimal cable config and channel looks like this:

# config/cable.yml
production:
  adapter: any_cable
  url: ws://anycable:8080/cable
  broadcast_adapter: redisx
  redis_url: <%= ENV['REDIS_URL'] %>

# docker-compose excerpt
services:
  anycable-go:
    image: anycable/anycable-go:1.5
    ports:
      - "8080:8080"
    environment:
      ANYCABLE_RPC_HOST: rpc:50051
      ANYCABLE_REDIS_URL: redis://redis:6379/0
      ANYCABLE_HEADERS: cookie,authorization
    depends_on:
      - rpc
      - redis

  rpc:
    image: our-app-rpc:latest
    environment:
      ANYCABLE_RPC_HOST: 0.0.0.0:50051
      RAILS_ENV: production
    expose:
      - "50051"

The trade-off is a gRPC hop on the auth path, which adds a small amount of latency on the initial handshake. We measured it. Worth it.

Topic-based sharding for the fan-out

The other thing we got wrong the first time was treating all topics as equal. They aren’t. Most rooms are quiet. A few rooms are everyone. When the finals goes live, one topic carries more than the rest of the platform combined, and if that topic lives on the same shard as everyone else’s, you’re back to the original problem with extra steps.

We added a small routing layer in front of the gateway that hashes topic IDs into shard buckets, and we drain shards on deploy instead of doing rolling restarts that re-trigger reconnects.

import { createHash } from 'node:crypto'

type Shard = { id: string; host: string }

export class TopicShardRouter {
  private ring: { hash: number; shard: Shard }[] = []

  constructor(shards: Shard[], replicas = 128) {
    for (const shard of shards) {
      for (let i = 0; i < replicas; i++) {
        const h = this.hash(`${shard.id}:${i}`)
        this.ring.push({ hash: h, shard })
      }
    }
    this.ring.sort((a, b) => a.hash - b.hash)
  }

  resolve(topicId: string): Shard {
    if (this.ring.length === 0) throw new Error('empty shard ring')
    const h = this.hash(topicId)
    const found = this.ring.find((e) => e.hash >= h)
    return (found ?? this.ring[0]).shard
  }

  private hash(input: string): number {
    const buf = createHash('sha1').update(input).digest()
    return buf.readUInt32BE(0)
  }
}

Consistent hashing so a single shard going away doesn’t reshuffle the entire fleet. The hot finals topic ends up isolated on its own shard, and a problem in that shard doesn’t leak.

Redis Streams instead of pub/sub

Pub/sub stayed in the stack for one release after AnyCable went in, and we paid for it once. A gateway pod restarted during a quiet moment, a small slice of clients reconnected, and the seconds of fan-out they missed were just gone. Nobody’s fault, that’s the contract.

We moved fan-out to Redis Streams with consumer groups. Durable per-shard streams, at-least-once delivery, replay window we control. Each gateway shard reads from its own stream, acks per message, and idempotency keys catch the duplicates.

import { createClient, RedisClientType } from 'redis'

const STREAM = (shardId: string) => `tick:shard:${shardId}`
const GROUP = 'gateway-fanout'

export class TickConsumer {
  private client: RedisClientType
  constructor(private shardId: string, private consumerName: string) {
    this.client = createClient({ url: process.env.REDIS_URL })
  }

  async start(handler: (payload: TickPayload) => Promise<void>) {
    await this.client.connect()
    await this.ensureGroup()
    while (true) {
      const res = await this.client.xReadGroup(
        GROUP,
        this.consumerName,
        [{ key: STREAM(this.shardId), id: '>' }],
        { COUNT: 256, BLOCK: 2000 },
      )
      if (!res) continue
      for (const stream of res) {
        for (const msg of stream.messages) {
          try {
            await handler(this.parse(msg.message))
            await this.client.xAck(STREAM(this.shardId), GROUP, msg.id)
          } catch (err) {
            // leave unacked. PEL retries on next poll. surface to alerting.
            console.error('tick handler failed', { id: msg.id, err })
          }
        }
      }
    }
  }

  private async ensureGroup() {
    try {
      await this.client.xGroupCreate(STREAM(this.shardId), GROUP, '$', { MKSTREAM: true })
    } catch (err: any) {
      if (!String(err.message).includes('BUSYGROUP')) throw err
    }
  }

  private parse(fields: Record<string, string>): TickPayload {
    return { id: fields.id, topic: fields.topic, payload: JSON.parse(fields.payload) }
  }
}

type TickPayload = { id: string; topic: string; payload: unknown }

The pending-entries list, PEL, is the part that makes this work under load. A consumer that crashes mid-message doesn’t drop it. The next consumer in the group picks it up from PEL, idempotency key on the handler dedupes if both somehow processed it.

What it felt like after

The next big live event ran at around 2.2M sustained concurrent connections without anyone touching the autoscaler. p99 tick fan-out stayed under 200 ms through the finals window. No reconnection-storm signature on the dashboards. The gateway pods looked bored.

There’s a separate story I’ll keep short here. On the creator platform I worked at, we had a derived index, an Elasticsearch projection of PostgreSQL data, that fell silently behind for hours during a publicly-visible event. The consumer was alive, the throughput metric looked fine, the data was stale. We learned the hard way that derived layers need their own freshness metric, not just a liveness one. We baked that lesson into the new Redis Streams plane from day one. Per-stream lag, per-consumer pending-entries count, alert on freshness, not just on whether the consumer is still consuming.

The new stack has more moving parts. A Go layer to babysit, a routing tier with its own deploy story, Streams instead of pub/sub. We pay for it in observability surface. We get it back in not getting paged at 9 p.m. on a finals night.

Takeaways

Autoscale will not fix a self-amplifying client-side bug. Backoff lives on the client, with jitter
Pub/sub is fan-out, not delivery. Reach for Streams the moment durability matters
Shard by topic, not by user, when one room can carry more load than the rest combined
Stop running production WebSockets out of the application process. Terminate them in a dedicated layer
Derived planes need a freshness metric, not just a throughput one
The fix is rarely “more pods”

Thanks for reading. If you’ve got thoughts, send them my way.