GraphQL Federation with NestJS

A practical walk through rolling out an Apollo Federation 2 subgraph in NestJS, with entity resolution, gateway auth, and the path off schema stitching.

It was a Wednesday afternoon at the combat-sports tournament platform I CTO’d in London. We were three weeks into rolling federation across a chunk of our backend, replacing a tired Apollo Gateway that had been stitching schemas together for over a year. The first subgraph going live was Athletes. The second was Rankings. They had to talk through the gateway, and the gateway had to keep auth coherent across both.

I’d shaped the migration over a long weekend with the backend lead. Apollo Federation 2, NestJS subgraphs, phased cutover. Eight days later Rankings was in prod. Here’s how we wired it, the parts that broke, and why I’d take federation over stitching every time on a multi-service backend.

Why stitching had to go

Schema stitching gets you started fast. It also gets you a single brittle process that owns every downstream schema, re-introspecting at boot, blowing up if any subschema is unreachable. We had hundreds of microservices on the platform, async comms standardized on Kafka, and the stitched gateway had become the slowest part of the read path. p99 sat around 480ms for a query touching three services. Half of that was the gateway re-resolving the same entity twice because stitching couldn’t share keys.

Federation flips this. Each service owns its slice of the graph. The gateway composes a supergraph at build time, not at runtime. Entity keys are first-class. Cross-service joins happen through @key and @ResolveReference, not ad-hoc resolver delegation. It’s a real protocol, not a workaround.

If you’re on stitching and your gateway is the slowest hop, you won’t optimize your way out. Cut.

A federated subgraph in NestJS

Here’s the Athletes subgraph stripped down. NestJS, Apollo driver in federation mode, code-first schema. The @key directive is the contract: any other subgraph can reference an Athlete by id.

import { Module } from '@nestjs/common';
import { GraphQLModule } from '@nestjs/graphql';
import { ApolloFederationDriver, ApolloFederationDriverConfig } from '@nestjs/apollo';
import { AthletesResolver } from './athletes.resolver';
import { AthletesService } from './athletes.service';

@Module({
  imports: [
    GraphQLModule.forRoot<ApolloFederationDriverConfig>({
      driver: ApolloFederationDriver,
      autoSchemaFile: { federation: 2 },
      context: ({ req }) => ({ req }),
      playground: false,
      introspection: process.env.NODE_ENV !== 'production',
    }),
  ],
  providers: [AthletesResolver, AthletesService],
})
export class AthletesModule {}

The resolver is where federation earns its keep. @ResolveReference is the hook the gateway calls when another subgraph hands it an Athlete reference. Implement it like a bulk loader, not a one-off lookup, or you’ll re-introduce the N+1 the gateway was supposed to kill.

import { Resolver, Query, ResolveReference, Args } from '@nestjs/graphql';
import { Directive, ID } from '@nestjs/graphql';
import DataLoader from 'dataloader';
import { Athlete } from './athlete.entity';
import { AthletesService } from './athletes.service';

@Resolver(() => Athlete)
export class AthletesResolver {
  private readonly loader: DataLoader<string, Athlete>;

  constructor(private readonly athletes: AthletesService) {
    this.loader = new DataLoader(async (ids) => {
      const rows = await this.athletes.findByIds(ids as string[]);
      const byId = new Map(rows.map((r) => [r.id, r]));
      return ids.map((id) => byId.get(id) ?? null);
    });
  }

  @Query(() => Athlete, { nullable: true })
  athlete(@Args('id', { type: () => ID }) id: string) {
    return this.loader.load(id);
  }

  @ResolveReference()
  resolveReference(reference: { __typename: string; id: string }) {
    return this.loader.load(reference.id);
  }
}

The DataLoader instance here is constructor-scoped which is fine for read-heavy queries inside a single request lifetime if you wire it as a request-scoped provider. We learned the hard way that a singleton DataLoader leaks across requests. I’ll come back to that.

Auth at the gateway

This is the part most teams underbake. The gateway has to authenticate the caller, then propagate identity into every subgraph call. Subgraphs trust the gateway, not the client. We put a NestJS guard on the gateway and a header-only auth strategy on each subgraph.

import { CanActivate, ExecutionContext, Injectable, UnauthorizedException } from '@nestjs/common';
import { GqlExecutionContext } from '@nestjs/graphql';
import { JwtService } from '@nestjs/jwt';

@Injectable()
export class GatewayAuthGuard implements CanActivate {
  constructor(private readonly jwt: JwtService) {}

  async canActivate(context: ExecutionContext): Promise<boolean> {
    const ctx = GqlExecutionContext.create(context);
    const req = ctx.getContext().req;
    const token = req.headers.authorization?.replace(/^Bearer\s+/i, '');
    if (!token) throw new UnauthorizedException('missing token');

    try {
      const payload = await this.jwt.verifyAsync(token);
      req.user = { id: payload.sub, scopes: payload.scopes ?? [] };
      return true;
    } catch {
      throw new UnauthorizedException('invalid token');
    }
  }
}

The gateway then forwards user context into subgraph requests through a RemoteGraphQLDataSource override. Plain header pass-through, no re-signing, signed at the edge once:

import { IntrospectAndCompose, RemoteGraphQLDataSource } from '@apollo/gateway';

class ContextForwarder extends RemoteGraphQLDataSource {
  willSendRequest({ request, context }: any) {
    const user = context?.req?.user;
    if (user) {
      request.http?.headers.set('x-user-id', user.id);
      request.http?.headers.set('x-user-scopes', user.scopes.join(','));
      request.http?.headers.set('x-request-id', context.req.headers['x-request-id'] ?? '');
    }
  }
}

export const gatewayConfig = {
  supergraphSdl: new IntrospectAndCompose({
    subgraphs: [
      { name: 'athletes', url: process.env.ATHLETES_URL! },
      { name: 'rankings', url: process.env.RANKINGS_URL! },
    ],
    pollIntervalInMs: 30_000,
  }),
  buildService: ({ url }) => new ContextForwarder({ url }),
};

Two non-obvious calls here. We sign the JWT once at the gateway and forward identity as plain headers internally because the network between gateway and subgraphs is private. And we always forward x-request-id so a query that fans out across three subgraphs shows up as one trace in Datadog APM, not three orphaned spans.

The Kafka consumer that ate our Saturday

We rolled the Rankings subgraph behind a feature flag. It read its data from a projection populated off our Kafka match-events topic by a consumer called standings-projector. Federation didn’t break here. The consumer did, and federation made the blast radius visible.

Mid-afternoon during a live federation tournament broadcast, the public leaderboard froze. The standings-projector consumer group started rebalancing every thirty seconds. Federation fans queries out; if one subgraph stalls, the gateway timeout fires and the whole query degrades. I’d reviewed the deploy. I’d ack’d it. PagerDuty had three pages in under two minutes.

First instinct was operational. kubectl rollout restart deployment/standings-projector. Consumers re-joined. Forty seconds later they rebalanced again. We were doing the same dance the group was already doing on its own.

The real fix came from reading pod logs side by side. One pod out of six had a different max.poll.interval.ms. Five pods at 300s. One pod at 60s. The sixth had been deployed with a stale image because someone had pushed a config-touching fix without bumping the image tag and the manifest referenced :latest. That pod did a slow downstream call to a federation-rules service that occasionally took 70s, past its poll interval, kicked itself out, and the whole group rebalanced. Cordoned the bad pod, storm drained in 90 seconds. Over the weekend we SHA-pinned every Kafka-touching deployment and added a CI check that fails the deploy if a consumer manifest references :latest.

Twelve minutes of stale standings during a live broadcast. The federation’s tech contact was understanding. Commentators less so.

Lesson: in a federated topology, your weakest subgraph is your gateway’s tail latency. Pin the things that touch consumer groups. Always.

Migrating off stitching

The migration is boring if you stage it. Two strict rules from our cutover.

One, ship federation behind an opt-in header before flipping defaults. The old stitched gateway and the new federated gateway ran side by side for two weeks. Clients on the new gateway sent x-graphql-mode: federated. We watched error rate and p99 per operation. When the numbers held for a week, we flipped the default and left the stitched gateway up as the fallback for another sprint.

Two, freeze schema changes during the cutover. No new types, no new fields, just the wiring change. Mixing a schema migration with a topology migration is how you end up debugging two unrelated bugs at 2 a.m. and convincing yourself both are the same.

Takeaways

Federation beats stitching on any backend with more than a handful of services. The gateway gets out of the runtime path.
@ResolveReference must be a bulk loader. Request-scoped DataLoader inside NestJS, never a singleton.
Sign auth at the gateway, forward identity as headers to subgraphs, always include x-request-id.
Pin SHAs on anything that touches a Kafka consumer group. Never :latest.
Per-subgraph observability is mandatory from day one. Your slowest subgraph is your gateway’s tail.
Cut over behind a header. Run both gateways in parallel for a sprint. Freeze the schema during the topology change.

Thanks for reading. If you’ve got thoughts, send them my way.