How a creator-economy platform finally sharded a multi-terabyte Aurora cluster after two years of avoidance, including the shard key call, hot-tenant rebalancing, a mid-flight UUID migration, and pushing cross-shard reporting to the warehouse.
It was a Tuesday morning at the creator-economy platform I worked at, and the Aurora writer was sitting on a multi-terabyte mountain of community data. Replica lag had just hit 14 minutes. The Community feeds were the most visible casualty, p99 reads on /communities/:id/posts climbed from around 120 ms to over 8 seconds, and the Slack channel was lighting up. I wasn’t on call that week but I’d built most of the routing layer on top of Aurora so I got pinged within a minute or two.
That morning was the one that finally made us shard.
We’d been talking about sharding for two years. We’d avoided it for two years. Honestly, every quarter we found another reason to push it. Reader replicas. Partitioning. Vacuum tuning. Query rewrites. Each one bought us another six months and we treated that like progress. It wasn’t. It was the engineering equivalent of taking out a payday loan against the next migration.
The argument against sharding was always the same. It was expensive. The team didn’t have the bandwidth. We’d lose cross-tenant joins on Aurora and reporting would suffer. The schema would be harder to migrate. You know, all the real reasons. And they were real. But the cost of waiting was real too, we just weren’t pricing it.
Here’s what avoidance actually looked like, in practice. Every hot table grew. The largest one, community_posts, was approaching a row count where every ALTER TABLE was a planning meeting. Maintenance windows stopped existing because there was no time of day when the platform was quiet enough. The pg_stat_activity view became required reading before any deploy. We had a Slack channel called #db-vibes that was supposed to be a joke and ended up being a serious operational tool.
The replica lag incident I started with. By the time I joined the thread, lag was at 14 minutes and growing. The on-call’s first move was to bump reader instance class up two tiers from r6g.4xlarge to r6g.16xlarge. Reasonable instinct, wrong root cause. Readers weren’t CPU bound, they were starved of WAL. A long-running ANALYZE on community_posts, scheduled in a maintenance cron that didn’t respect peak-hours, was holding up the writer. Killed the analyze, lag drained in around 6 minutes.
But here’s the part that mattered for the sharding conversation. The reason a single ANALYZE could brown out reads for millions of customers is that community_posts was a single hot table on a single writer. That was the architectural fact, and no amount of tuning was going to change it. The Tuesday incident was the third one that quarter. The pattern was no longer a one-off, it was a property of the system.
I wrote a one-pager that afternoon and sent it to my staff lead. The next sprint we shaped the sharding work.
tenant_id was the obvious call and we made it pretty quickly. Every community-domain table already had a tenant_id column, every read path already filtered on it, and the data was already physically scoped per creator account. The alternatives didn’t survive 15 minutes of whiteboard time. Hash-of-user-id broke cross-tenant queries that didn’t exist anyway. Geographic sharding added legal complexity for very little operational gain. Feature-based sharding was an operational nightmare waiting to happen.
The router was the first thing we built. TypeScript, sat in front of the Rails app on the read path and in front of the database connection pool on the write path.
import { createHash } from 'crypto';
import type { Pool } from 'pg';
type ShardId = 'shard-0' | 'shard-1' | 'shard-2' | 'shard-3';
const SHARD_POOLS: Record<ShardId, Pool> = loadShardPools();
const PINNED_TENANTS: Map<string, ShardId> = loadPinnedTenantsFromConfig();
const SHARD_COUNT = 4;
export function resolveShard(tenantId: string): ShardId {
const pinned = PINNED_TENANTS.get(tenantId);
if (pinned) return pinned;
const hash = createHash('sha1').update(tenantId).digest();
const bucket = hash.readUInt32BE(0) % SHARD_COUNT;
return `shard-${bucket}` as ShardId;
}
export async function withTenantConnection<T>(
tenantId: string,
work: (pool: Pool) => Promise<T>,
): Promise<T> {
const shardId = resolveShard(tenantId);
const pool = SHARD_POOLS[shardId];
if (!pool) {
throw new Error(`no pool registered for shard ${shardId}`);
}
return work(pool);
}
The pinned-tenants map is the part that doesn’t show up in textbooks. We needed it on day one. More on that next.
Three large creators sat on roughly an eighth of the community data between them. Random hash distribution put two of them on the same shard, which would have meant one shard was permanently at twice the load of the others. Not a long-term plan.
We built a manual override layer. A tenant_shard_overrides table in a shared metadata database, polled into the router every 30 seconds. If a tenant was in the override table, the router used that shard; otherwise it fell back to the hash. The pinned tenants got their own shards with bigger instance classes.
CREATE TABLE tenant_shard_overrides (
tenant_id UUID PRIMARY KEY,
shard_id TEXT NOT NULL,
pinned_at TIMESTAMPTZ NOT NULL DEFAULT now(),
pinned_by TEXT NOT NULL,
reason TEXT NOT NULL
);
CREATE INDEX idx_tenant_shard_overrides_shard ON tenant_shard_overrides (shard_id);
I’d argue that if you’re sharding multi-tenant data, you need this table from day one even if it stays empty for a year. Hot tenants are a property of multi-tenant SaaS, not an edge case. Plan for them.
About two months into the sharding work we hit a thing I should have seen coming. Some legacy tables still used bigint primary keys with per-database sequences. On a single Aurora writer that was fine. Across four shards, two creators on different shards could end up generating the same primary key for different rows. The collision was theoretical for now and inevitable later.
We switched to UUID v7 for new rows. Timestamp-prefixed, so the b-tree stayed roughly insert-ordered and we didn’t tank write performance the way classic UUID v4 would. Kept bigints for existing rows and dual-wrote during the cutover.
CREATE OR REPLACE FUNCTION uuid_v7() RETURNS uuid AS $$
DECLARE
unix_ts_ms bigint;
uuid_bytes bytea;
BEGIN
unix_ts_ms := (extract(epoch from clock_timestamp()) * 1000)::bigint;
uuid_bytes := set_byte(gen_random_bytes(16), 6,
(b'0111' || (get_byte(gen_random_bytes(1), 0)::bit(4)))::bit(8)::int);
uuid_bytes := set_byte(uuid_bytes, 8,
(b'10' || (get_byte(gen_random_bytes(1), 0)::bit(6)))::bit(8)::int);
uuid_bytes := set_byte(uuid_bytes, 0, ((unix_ts_ms >> 40) & 255)::int);
uuid_bytes := set_byte(uuid_bytes, 1, ((unix_ts_ms >> 32) & 255)::int);
uuid_bytes := set_byte(uuid_bytes, 2, ((unix_ts_ms >> 24) & 255)::int);
uuid_bytes := set_byte(uuid_bytes, 3, ((unix_ts_ms >> 16) & 255)::int);
uuid_bytes := set_byte(uuid_bytes, 4, ((unix_ts_ms >> 8) & 255)::int);
uuid_bytes := set_byte(uuid_bytes, 5, ( unix_ts_ms & 255)::int);
RETURN encode(uuid_bytes, 'hex')::uuid;
END;
$$ LANGUAGE plpgsql VOLATILE;
The mid-flight UUID migration also produced one of the scariest moments of the quarter. A late-evening deploy added a new external_id column to one of the hot tables with null: false and a default, on a table with hundreds of millions of rows. The migration used the strong_migrations “safer” helper. It acquired an ACCESS EXCLUSIVE lock for around 87 seconds while applying the default backfill. Login error rate hit 100% for about 85 seconds. PagerDuty woke half the senior engineers in California.
First instinct was to roll back. Rails doesn’t have a clean rollback for a partially-applied add_column_with_default. Killing it mid-flight risked an inconsistent metadata state. We let it finish. Locks released, login recovered within 15 seconds of the lock release. Postmortem rewrote the migration as the three-step Aurora dance, add_column null: true, batched backfill, then change_column_null. The lesson is older than sharding but sharding made it sharper, on a hot table at scale, every schema change is a three-step migration. The “safe” gem helper is safer than raw ActiveRecord, not safe.
Reporting was the part of the migration nobody wanted to own. Operational reads were now per-shard. But the reporting team needed cross-tenant aggregates, top creators by engagement, weekly cohort comparisons, that kind of thing. Fan-out querying across four shards from the Rails app was a non-starter.
We pushed analytical reads off the shards entirely. CDC into ClickHouse via a Debezium-style pipeline, one topic per source table, one ClickHouse table per topic with a ReplacingMergeTree keyed on the primary key. Reporting queries hit ClickHouse, never Aurora.
CREATE TABLE community_posts_analytics
(
tenant_id UUID,
post_id UUID,
author_id UUID,
created_at DateTime64(3),
updated_at DateTime64(3),
engagement UInt32,
is_deleted UInt8,
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, post_id);
This split also gave us something we hadn’t planned for, the reporting team stopped accidentally taking down the operational database with badly-written aggregation queries. Worth the warehouse cost on its own.
Two things, mostly.
I’d start sharding before the first replica-lag incident, not after. The signal that you need to shard is not the incident, it’s the trend line on the writer’s CPU graph six months earlier. If I could go back, I’d have shipped the shard router as an empty pass-through a full year before we needed it, just to lock in the shape of the API.
I’d also pick the shard key from day one, on a brand new product, even if there’s only one shard and the router is a no-op. Reverse-engineering the shard key after the fact is the most expensive piece of the migration. Every query that didn’t carry the key had to be rewritten. Every join across what used to be tenant boundaries had to be re-evaluated. Doing it cold, while production is shipping, is brutal.
Thanks for reading. If you’ve got thoughts, send them my way.