Event Schema Registry

How I run Confluent Schema Registry with Avro and Protobuf across teams, with compatibility modes, evolution rules, and CI-time breaking-change detection.

Wednesday morning at the federation platform I CTO’d in London. Hundreds of microservices, Kafka as the backbone, a live tournament burning through match-events. A backend lead shipped a “tiny” tweak to MatchCompleted over the weekend. Renamed a field. Added an enum value. CI was green. Production was on fire by 09:40 because the standings-projector deserializer started throwing on every third message and the leaderboard froze mid-broadcast.

Yeah. That was the day I stopped treating event schemas as “just a JSON shape we agree on in Slack”. They are an API. No registry, no CI gate, and one squad’s tiny tweak becomes another squad’s outage.

This is how I run Confluent Schema Registry. Avro by default, Protobuf when the team is already there, the compatibility modes I care about, and a CI check that fails the PR before the bad schema reaches a broker.

Why a registry, not just shared types

I’ve watched teams try to solve cross-service contracts with shared TypeScript types in a monorepo. Works until your producer ships in Ruby, your projector in Node, your analytics consumer in Python, and someone’s Lambda is on a version from six months ago. Types in code don’t survive the network. Bytes do.

A registry gives you three things shared types can’t: a single source of truth keyed by subject, versioning so old consumers keep reading old bytes, and a compatibility check on every registration that refuses schemas which break downstream readers. The third one saves your weekend.

Avro vs Protobuf, briefly

I default to Avro for new event topics. Smallest mental model for “add fields with defaults, stay backward-compatible”, and the Confluent serializer wraps each payload with a 5-byte header carrying the schema ID so consumers cache the writer’s schema on first use. Protobuf I reach for when the team is already speaking proto from gRPC and adopting a second IDL would just create drift. JSON Schema is also supported, I avoid it for event payloads.

Here’s a representative Avro schema for the kind of event that bit me:

{
  "type": "record",
  "name": "MatchCompleted",
  "namespace": "com.federation.events.matches.v1",
  "doc": "Emitted when a match result is final and judges have signed off.",
  "fields": [
    { "name": "match_id", "type": "string" },
    { "name": "tournament_id", "type": "string" },
    { "name": "winner_athlete_id", "type": ["null", "string"], "default": null },
    { "name": "result_type", "type": {
        "type": "enum",
        "name": "ResultType",
        "symbols": ["KO", "TKO", "SUBMISSION", "DECISION", "DRAW", "NO_CONTEST"],
        "default": "DECISION"
      }
    },
    { "name": "ended_at_utc", "type": { "type": "long", "logicalType": "timestamp-millis" } },
    { "name": "round_number", "type": "int", "default": 0 }
  ]
}

Two things to notice. The namespace is versioned (.v1). When I genuinely need a breaking change, I cut a .v2 topic and run both in parallel, I don’t mutate .v1. And every nullable field has a default. That’s what makes adding fields backward-compatible without a flag day.

Compatibility modes, the only three I use

I only pick from three:

BACKWARD (my default for consumer-heavy topics): new fields must have defaults, removing a field is fine if it had one. Lets producers move ahead first.
FORWARD: producers on the previous schema can still write data readable by consumers on the new one. Useful when projectors deploy ahead of producers.
FULL: both. Strictest. For cross-team contracts where neither side wants to coordinate deploy order.

I don’t use NONE. Set to NONE the registry is a write-only logbook and you’ve given up the whole point.

Set this on the subject, not globally, so each topic is governed by its contract owner:

curl -s -X PUT \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --user "$SR_USER:$SR_PASS" \
  --data '{"compatibility": "FULL"}' \
  "$SR_URL/config/match-events-value"

Producer and consumer, the boring code

The serializer talks to the registry, the consumer fetches and caches schemas by ID. A NestJS producer:

import { Injectable, Logger } from '@nestjs/common';
import { Kafka, Producer } from 'kafkajs';
import { SchemaRegistry, SchemaType } from '@kafkajs/confluent-schema-registry';
import { readFileSync } from 'node:fs';

@Injectable()
export class MatchEventsProducer {
  private readonly log = new Logger(MatchEventsProducer.name);
  private producer: Producer;
  private registry: SchemaRegistry;
  private schemaId!: number;

  constructor() {
    const kafka = new Kafka({
      clientId: 'rankings-producer',
      brokers: process.env.KAFKA_BROKERS!.split(','),
      ssl: true,
    });
    this.producer = kafka.producer({ idempotent: true, maxInFlightRequests: 5 });
    this.registry = new SchemaRegistry({
      host: process.env.SCHEMA_REGISTRY_URL!,
      auth: { username: process.env.SR_USER!, password: process.env.SR_PASS! },
    });
  }

  async onModuleInit() {
    await this.producer.connect();
    const schema = readFileSync('schemas/match_completed.avsc', 'utf8');
    const { id } = await this.registry.register(
      { type: SchemaType.AVRO, schema },
      { subject: 'match-events-value' },
    );
    this.schemaId = id;
  }

  async emit(event: MatchCompleted) {
    const value = await this.registry.encode(this.schemaId, event);
    await this.producer.send({
      topic: 'match-events',
      messages: [{ key: event.match_id, value }],
    });
  }
}

Consumer side is almost a mirror. On decode, the client reads the 5-byte header, fetches the writer’s schema, and projects it onto your reader schema. As long as the compatibility mode held, that projection succeeds even when the producer is ahead.

Stopping breaking changes in CI, not in prod

The registry rejects incompatible schemas at registration time. Not good enough. By the time a deploy job registers, the PR has merged and the release is half-rolled out. So I run the compatibility check on the PR, against the live registry, before merge:

name: schema-compat
on:
  pull_request:
    paths:
      - "schemas/**"
      - ".github/workflows/schema-compat.yml"

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Check Avro compatibility against registry
        env:
          SR_URL: ${{ secrets.SCHEMA_REGISTRY_URL }}
          SR_USER: ${{ secrets.SR_USER }}
          SR_PASS: ${{ secrets.SR_PASS }}
        run: |
          set -euo pipefail
          for f in schemas/*.avsc; do
            subject="$(basename "$f" .avsc)-value"
            body=$(jq -Rs '{schema: ., schemaType: "AVRO"}' < "$f")
            http_code=$(curl -s -o /tmp/resp.json -w "%{http_code}" \
              -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
              --user "$SR_USER:$SR_PASS" \
              --data "$body" \
              "$SR_URL/compatibility/subjects/$subject/versions/latest")
            if [ "$http_code" != "200" ]; then
              echo "::error file=$f::registry returned $http_code"
              cat /tmp/resp.json
              exit 1
            fi
            if ! jq -e '.is_compatible == true' /tmp/resp.json > /dev/null; then
              echo "::error file=$f::incompatible change for $subject"
              cat /tmp/resp.json
              exit 1
            fi
          done

That job fails the PR before the schema reaches a broker. The squad that owns the topic is auto-tagged via CODEOWNERS on schemas/, so cross-team changes get a human review, not just a green check.

War story: rankings projector

Back to the cold open. The “tiny tweak” was renaming winner_id to winner_athlete_id and adding NO_CONTEST. No registry yet, just typed events in a shared package and a tribal agreement to bump a minor version. Producer deployed Friday evening. Projector image was pinned at an older tag in a separate Helm chart no one had touched in weeks.

What went wrong: Saturday during the broadcast, the projector started throwing on winner_id is undefined, logged warnings, kept consuming, kept committing offsets. Standings stopped updating but the consumer group looked healthy on Datadog. Leaderboard froze at 14:32 for 12 minutes before we noticed.

First wrong fix: rolled the producer back. The projector didn’t recover, the producer had already emitted a few hundred new-shape messages and the projector was stuck on those. We then advanced the consumer offset past the bad range and dropped real ranking updates the federation was waiting on.

Real fix: backfilled missed standings from PostgreSQL, shipped a projector patch that tolerated both field names, and the Monday after I made registry adoption non-negotiable. Every new topic shipped with a registered Avro schema, BACKWARD by default, the CI gate above. Old topics migrated quarter by quarter.

Cost: 12 minutes of stale standings during a publicly-visible competition window. The federation contact was understanding once. There is no second time.

War story: the silent default that wasn’t

A different incident at the creator economy platform I worked at. Same shape. A new field went on subscription_changed with BACKWARD mode and a default. Producer rolled out. Consumers stayed on the old reader schema, which, per Avro projection rules, was fine.

What went wrong: one analytics consumer decoded events with a custom serializer that didn’t consult the registry. It assumed writer and reader schemas matched byte-for-byte and dropped the new field entirely. Reporting dashboards showed wrong churn numbers for ~36 hours before a finance engineer noticed during a quarterly close.

First wrong fix: rebuilt the dashboards from a corrected query, thinking the issue was SQL. It wasn’t. The bytes hitting the warehouse loader were missing a column that should have been populated.

Real fix: ripped out the custom serializer, routed the consumer through the same registry client as everything else, added an integration test that registers a schema, evolves it, and asserts the consumer reads both old and new payloads. That test now runs on every PR that touches an event consumer.

Cost: a day and a half of wrong analytics and a runbook line that reads “if a consumer isn’t using the registry client, the consumer is wrong, not the producer”. I’m the reason that line is in there.

Takeaways

Treat event schemas as an API. A registry is the source of truth, not your shared types package.
Default to BACKWARD for most topics, FULL for cross-team contracts, never NONE.
Version the namespace, not the field. Cut a new topic for breaking changes and run them in parallel.
Run the compatibility check on the PR, against the live registry, before merge. CI is your gate, not a runtime surprise.
Trust the registry client. Custom serializers are how you get silently wrong data 36 hours later.

Thanks for reading. If you’ve got thoughts, send them my way.