How decoupling deploy from release with feature flags turned a high-stress Rails monolith into boring daily ships, and the war stories that pushed me there.
It was a Wednesday evening, my migration had been live for about nine minutes, and login was returning 100% errors. I’d reviewed the migration that morning. I’d ack’d it. I’d shipped on a Wednesday on purpose, because at the creator economy platform I worked at we shipped every day and Wednesdays were as boring as deploys got.
Yeah. Boring deploys. That was the lie I believed for a long time.
Trunk-based, green CI, daily ships, fast pipelines. On the dashboard it looked great, and the team had earned all of it. We could get a hotfix from “merged” to “in production” in maybe twelve minutes including the migration step.
Here’s the thing nobody had said out loud yet. Deploy and release were the same event. Every merge into main went straight to customers. When the code shipped, the behavior shipped. The “boring” daily ship was actually a high-wire act with very fast feet.
That works right up until it doesn’t.
So that Wednesday. Rails monolith, Aurora PostgreSQL, late-evening deploy. We were adding a non-null column to users, a table with hundreds of millions of rows. The migration used the strong_migrations helper for add_column_with_default, the “safer than raw ActiveRecord” path. I’d read it. I’d approved it. It was supposed to be safe.
It wasn’t. The migration grabbed an ACCESS EXCLUSIVE lock on users while applying the default backfill. On Aurora at that row count, that meant about 90 seconds of blocked writes. Login, sign-up, password reset, every webhook tied to user creation. All blocked.
My first instinct was rollback. Of course it was. But Rails doesn’t have a clean rollback for a partially-applied add_column_with_default. By the time the lock would’ve released we’d already be a minute into the cascade. Killing it mid-flight risked an inconsistent metadata state.
The “real fix” was that there wasn’t one in the moment. We let the migration finish. It took 87 seconds. The lock released. Login recovered within 15 seconds because the dependent service had a tight retry loop. ~85 seconds of 100% login error rate at peak Pacific hours.
Postmortem the next day was clean. Split the migration into three steps. Add the column nullable, backfill in batches in a separate job, then flip nullability once 100% backfilled. Add a CI rule that blocks add_column with a non-null default against tables over 10M rows.
But the question that stuck with me wasn’t “how do we write safer migrations.” It was: why was the customer-visible behavior change tied to the deploy of the code that caused it? Why couldn’t I have pushed the schema and then turned it on later, after I’d watched a few minutes go by?
That was the day I stopped trusting daily deploys as the goal.
A feature flag is not a config toggle. I’d been treating them like config toggles for years. A real flag is a contract: this merged code is in production, but its visible behavior is gated. You ship the code dark. You flip 1%. You watch. 10%. Watch. 100%. Then you leave the flag in for thirty days as your rollback button.
We ended up on LaunchDarkly because the SDK could fail closed on its own when the network was sad. The technique works on a homegrown table-backed flag system too. The tool isn’t the magic.
The pattern I keep reaching for looks like this:
import { init, LDClient } from "launchdarkly-node-server-sdk";
import { logger } from "@platform/observability";
type FlagContext = {
kind: "user";
key: string;
customerId: string;
region: "us" | "eu";
};
let client: LDClient | null = null;
export async function bootFlags(): Promise<void> {
client = init(process.env.LD_SDK_KEY!, {
streamUri: process.env.LD_STREAM_URI,
offline: process.env.NODE_ENV === "test",
timeout: 3,
});
await client.waitForInitialization();
}
export async function isOn(
flag: string,
ctx: FlagContext,
fallback = false,
): Promise<boolean> {
try {
if (!client) return fallback;
return await client.variation(flag, ctx, fallback);
} catch (err) {
logger.warn({ flag, err }, "flag eval failed, defaulting closed");
return fallback;
}
}
Two things matter. The fallback defaults to false. If anything goes wrong, the new behavior stays off and the old behavior keeps running. And the eval is wrapped in try/catch with an explicit log, because the day the flag service has a bad ten minutes I do not want it taking checkout down with it.
Default closed. Always. If you only remember one thing from this article.
The second war story I keep thinking about happened earlier, at the same platform. Our branded-mobile-app pipeline. Rails plus Python plus Fastlane plus GitHub Actions, automating native iOS and Android submissions for thousands of creator-owned apps. Hundreds of releases a week. The pipeline had been in production for about six months and felt boring in the good way.
Wednesday morning, the pending_apple_review Sidekiq queue started backing up. By lunch about 270 customer app builds were stuck in “Waiting for Review” on App Store Connect, but our pipeline thought they’d been submitted successfully. Support had 80+ tickets in by 2 p.m. Pacific. Root cause: Apple’s Connect API was silently throttling our submission endpoint, returning 200 OK with a body that looked normal. The submissions were being dropped on their side.
We already had auto-retry on 5xx. So someone, reasonably, extended it to retry on “stuck” too. That made everything worse. Apple started seeing what looked like duplicate submissions, and a chunk of customers ended up with two competing review records and conflicting metadata. The retry was treating 200 OK as truth.
The real fix took the morning. Pull the auto-retry. Add a circuit breaker that verified submission state via a separate GET against the App Store Connect resource, not via the POST response. Write a one-shot reconciliation job using an idempotency key from app_id + version + git_sha to dedupe pending reviews against Apple’s source of truth.
The lesson I dragged out of that morning: anything talking to a human-moderated upstream (App Review, Play Review, payment dispute, fraud review) gets a kill switch wrapping it. Defaulting closed. The shape ends up looking like this in the monolith:
class Iap::AppleRenewalHandler
class Disabled < StandardError; end
KILL_FLAG = "iap.apple.renewal.enabled"
def call(payload)
raise Disabled unless FeatureFlags.on?(KILL_FLAG, default: false)
key = idempotency_key_for(payload)
return :duplicate if SubscriptionRenewal.exists?(idempotency_key: key)
SubscriptionRenewal.create!(
idempotency_key: key,
apple_original_transaction_id: payload[:original_tx_id],
notification_uuid: payload[:notification_uuid],
raw_payload: payload,
)
EnqueueRenewalProcessing.perform_async(key)
:enqueued
rescue ActiveRecord::RecordNotUnique
:duplicate
end
private
def idempotency_key_for(payload)
"#{payload[:original_tx_id]}:#{payload[:notification_uuid]}"
end
end
The flag defaults closed. If on-call needs to stop the bleeding, they flip one switch, the handler raises, the endpoint returns 503 quickly, and Apple retries until the system is ready. We do not have to redeploy. We do not have to revert. The flag is the lever.
That logic, two weeks before the BMA pile-up, would have given us back the morning.
The technical work was the easier half. The cultural shift was where the real change happened, and it took longer than I expected.
Engineers who used to argue about merge windows stopped caring about merge windows. The customer-visible decision moved from “is this deploy safe” to “is this flag flip safe.” The second is reversible in a second. Once that became how the team thought, Friday afternoons went from “do not ship” to “go ahead, the flag is at 0% anyway.”
Wiring it into CI helped lock that in:
name: progressive-rollout
on:
workflow_dispatch:
inputs:
flag_key:
required: true
target_percent:
required: true
jobs:
promote:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run flag health probe
run: |
node scripts/flag-probe.js \
--flag "${{ inputs.flag_key }}" \
--window 10m \
--error-budget 0.5
- name: Promote rollout
if: success()
env:
LD_API_KEY: ${{ secrets.LD_API_KEY }}
run: |
node scripts/ld-rollout.js \
--flag "${{ inputs.flag_key }}" \
--percent "${{ inputs.target_percent }}"
The probe reads our Datadog SLO for the affected service over the last ten minutes. If error budget is healthy, the promote step runs. Otherwise the workflow fails and a human gets paged before rollout escalates. Nothing fancy. The point is the rollout isn’t a person typing into a UI at 2 a.m., it’s a workflow with a brake.
Flag debt is real. Stale flags accumulate fast on a team that ships every day. Every PR that adds a flag should also add the sunset date in the same PR description, and the cleanup ticket should go into the backlog at flag-creation time, not someday. Otherwise you end up with a flag graveyard nobody is brave enough to delete from, which is its own production risk.
Thanks for reading. If you’ve got thoughts, send them my way.