How I choose between WebSocket Gateways, SSE, and long polling in NestJS. Redis adapter scaling, JWT-on-handshake auth, and the reconnect storm that taught me backoff lives on the client.
It was 09:31:14 on a Tuesday at a real-time trading platform I architected, 74 seconds after the London open. Tick prices were fanning out over Socket.io to retail and institutional clients, charts ticking, everything fine for about a minute. Then connections started dropping en masse. Clients reconnected immediately, got dropped again, reconnected again. Within 90 seconds every gateway pod was pinned at 100% CPU and tick fan-out p99 had gone from around 80 ms to 3 seconds. The worst failure mode a trading product can ship: stale charts during the most-watched window of the week.
I was on-call. I’ll come back to what I did wrong first.
That morning is the reason I write real-time NestJS code the way I do now. The transport choice matters less than people think. Reconnect behavior, auth at handshake, and horizontal fan-out are where things actually break.
I default to NestJS WebSocket Gateways when both sides need to push. Live dashboards, chat, presence, anything bidirectional. The moment it goes one-way (server pushing notifications, build logs, slow AI token streams) I reach for SSE first. It’s HTTP, it survives corporate proxies, and the browser’s EventSource already does reconnect for you.
Long polling is the fallback I leave in the code, not the path I optimize. If a customer’s network kills both WS and SSE, long polling will still answer. It’s expensive and it’s slow. It earns its keep maybe twice a year, for users behind firewalls I’ll never see.
Concrete rule I keep: if you’re tempted to use WebSockets for one-way fan-out, you almost certainly want SSE. Half the bugs I’ve debugged on real-time products are people using a duplex transport for a simplex problem.
A single NestJS gateway pod can hold a lot of sockets. Not enough for anything serious. The moment you have more than one pod, two clients connected to different pods can’t see each other’s broadcasts without a pub/sub layer between them. That’s what @socket.io/redis-adapter is for.
import { IoAdapter } from '@nestjs/platform-socket.io';
import { ServerOptions } from 'socket.io';
import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';
import { INestApplicationContext, Logger } from '@nestjs/common';
export class RedisIoAdapter extends IoAdapter {
private readonly logger = new Logger(RedisIoAdapter.name);
private adapterConstructor!: ReturnType<typeof createAdapter>;
constructor(private readonly app: INestApplicationContext) {
super(app);
}
async connectToRedis(url: string): Promise<void> {
const pub = createClient({ url });
const sub = pub.duplicate();
pub.on('error', (e) => this.logger.error(`redis pub error: ${e.message}`));
sub.on('error', (e) => this.logger.error(`redis sub error: ${e.message}`));
await Promise.all([pub.connect(), sub.connect()]);
this.adapterConstructor = createAdapter(pub, sub);
}
createIOServer(port: number, options?: ServerOptions): unknown {
const server = super.createIOServer(port, {
...options,
transports: ['websocket'],
pingInterval: 20_000,
pingTimeout: 25_000,
maxHttpBufferSize: 1_000_000,
});
server.adapter(this.adapterConstructor);
return server;
}
}
Two things matter. transports: ['websocket'] only. Allowing the polling fallback at the gateway tier is what lets a misbehaving client thrash you with HTTP requests when its socket dies. If you need long polling, run it as a separate service with its own scaling rules. The other thing is the ping interval. The defaults are too aggressive for cross-region traffic. We tuned ours after watching healthy iPad clients on flaky office wifi get dropped every minute.
The gateway itself stays small. Rooms, broadcast, that’s it.
import {
WebSocketGateway,
WebSocketServer,
SubscribeMessage,
OnGatewayConnection,
OnGatewayDisconnect,
MessageBody,
ConnectedSocket,
} from '@nestjs/websockets';
import { UseGuards, Logger } from '@nestjs/common';
import { Server, Socket } from 'socket.io';
import { WsJwtGuard } from './ws-jwt.guard';
@WebSocketGateway({ namespace: '/quotes' })
@UseGuards(WsJwtGuard)
export class QuotesGateway implements OnGatewayConnection, OnGatewayDisconnect {
@WebSocketServer() server!: Server;
private readonly logger = new Logger(QuotesGateway.name);
async handleConnection(client: Socket): Promise<void> {
const userId = client.data.userId as string;
await client.join(`user:${userId}`);
this.logger.log(`connect ${client.id} user=${userId}`);
}
handleDisconnect(client: Socket): void {
this.logger.log(`disconnect ${client.id}`);
}
@SubscribeMessage('subscribe.symbol')
async onSubscribeSymbol(
@ConnectedSocket() client: Socket,
@MessageBody() body: { symbol: string },
): Promise<{ ok: true }> {
await client.join(`symbol:${body.symbol}`);
return { ok: true };
}
}
The single most common mistake I see in NestJS WebSocket code is auth as the first message after connection. Don’t. The socket has already been accepted, you’re already paying for a TCP slot, a malicious client can hold it open forever and never identify itself. Verify the JWT during the upgrade.
import { CanActivate, ExecutionContext, Injectable, Logger } from '@nestjs/common';
import { JwtService } from '@nestjs/jwt';
import { Socket } from 'socket.io';
@Injectable()
export class WsJwtGuard implements CanActivate {
private readonly logger = new Logger(WsJwtGuard.name);
constructor(private readonly jwt: JwtService) {}
canActivate(ctx: ExecutionContext): boolean {
const client = ctx.switchToWs().getClient<Socket>();
const token =
(client.handshake.auth?.token as string | undefined) ??
(client.handshake.headers.authorization?.toString().replace(/^Bearer\s+/i, ''));
if (!token) {
this.logger.warn(`reject ${client.id}: missing token`);
client.disconnect(true);
return false;
}
try {
const payload = this.jwt.verify<{ sub: string; exp: number }>(token);
client.data.userId = payload.sub;
client.data.tokenExp = payload.exp;
return true;
} catch (err) {
this.logger.warn(`reject ${client.id}: bad token`);
client.disconnect(true);
return false;
}
}
}
The client.handshake.auth.token channel matters. It’s what Socket.io clients use to pass tokens during the upgrade, not as a query param. Query strings end up in nginx access logs and CDN traces. Tokens in auth don’t.
One more thing. Tokens expire. A long-lived socket has to handle that. We refresh server-side on a timer and disconnect anyone whose token has been expired for more than a grace period. The client reconnects with a fresh token and rejoins its rooms. The reconnect path is part of the design, not an afterthought.
When the dashboard at the creator economy platform I worked at started shipping live metrics widgets a couple of years back, half the team’s instinct was to add a Socket.io connection. I pushed for SSE on the read-only widgets. NestJS makes it almost free.
import { Controller, Sse, UseGuards, Req, MessageEvent } from '@nestjs/common';
import { Observable, interval, map, switchMap, from } from 'rxjs';
import { Request } from 'express';
import { JwtAuthGuard } from '../auth/jwt-auth.guard';
import { MetricsService } from './metrics.service';
@Controller('metrics')
export class MetricsController {
constructor(private readonly metrics: MetricsService) {}
@Sse('stream')
@UseGuards(JwtAuthGuard)
stream(@Req() req: Request): Observable<MessageEvent> {
const userId = (req.user as { id: string }).id;
return interval(5_000).pipe(
switchMap(() => from(this.metrics.snapshotForUser(userId))),
map((snapshot) => ({ data: snapshot, id: snapshot.version.toString(), type: 'metrics' })),
);
}
}
The browser does the reconnect. The id field lets the server resume from where the client left off via the Last-Event-ID header. No Redis adapter, no rooms, no namespaces, just an HTTP response that stays open. For one-way streams, this is almost always the right call.
Back to that Tuesday on the trading platform. Connections dropping, every gateway pod pinned. My first move was to scale gateway pods three times via the autoscaler’s manual override. Pure muscle memory. New pods came online, hit the storm head on, went CPU-bound within 20 seconds of joining the pool. The higher pod count made it worse because clients now got more partial-success reconnects, briefly seeing “connected” before getting dropped again.
The real fix was two things in parallel. An emergency client-side config push through a remote-config channel we’d built for exactly this kind of moment: jittered exponential backoff on reconnects, min 200 ms, max 30 s, factor 2, jitter plus or minus 50 percent. And a per-IP connection rate limiter at nginx, three new connections per second per IP. The pool stabilized in about 8 minutes and tick fan-out latency dropped back under 200 ms.
Cost was around 14 minutes of degraded tick delivery during market open. The lesson is the one I should have known before the alarm fired. Autoscale isn’t a fix for a self-amplifying client bug. Backoff lives on the client, not the server. The server can only meter the damage.
@socket.io/redis-adapter. Disable the polling transport at the gateway tier.client.handshake.auth, never query strings.id and Last-Event-ID. The browser will do the work.Thanks for reading. If you’ve got thoughts, send them my way.