Connection Pool Exhaustion at 3 AM

A Postgres outage caused by a nightly batch, analyst dashboards, and Sidekiq workers all competing for the same connection pool, and what actually fixed it.

It was 3:07 in the morning and my phone was screaming about login failures. I was still half asleep when I joined the war room. Dashboards were red across the board. Logins were failing. Webhooks were timing out. The Rails app couldn’t talk to Postgres.

I opened the database dashboard and the connection count was pinned, flat, at 100. Not 99. Not 101. Exactly 100. The writer had hit max_connections and Postgres was rejecting everything else with FATAL: sorry, too many clients already.

This was at a SaaS platform I worked on a few years back. Rails monolith, Sidekiq for background work, Aurora Postgres on AWS. The kind of stack that runs quietly for months and then bites you on a Tuesday at 3 AM.

The page that woke me up

The on-call dashboard was unhelpful in the way only a generic dashboard can be. CPU was fine. IOPS was fine. Memory was fine. Replica lag was small. The only red number was connections, and it was very red.

I pulled pg_stat_activity from a psql session that, mercifully, got a connection because I’d been logged in five minutes earlier on another tab.

SELECT state, application_name, count(*) AS n
FROM pg_stat_activity
WHERE datname = 'app_production'
GROUP BY state, application_name
ORDER BY n DESC;

The result told the whole story in about ten seconds. Roughly half the pool was Sidekiq workers in active. A big chunk was the Rails web pods in idle, holding connections between requests. Nine sessions in idle in transaction from Metabase, sitting on transactions open for over forty minutes. Plus a nightly batch job, running its quarterly cohort export, holding eight more.

Four different consumers, each assuming the pool was theirs.

Three tenants, one connection pool

Here’s what was actually competing for those 100 slots that night.

The Rails web tier had about a dozen pods, each with a pool of 5. That alone reserved 60 connections during peak. Sidekiq had four workers, each configured for 25 threads, each thread potentially holding a connection. The nightly batch job opened its own ActiveRecord connection per worker and was running an unusually heavy partition rollup. And Metabase, which we used for analyst dashboards, was happily opening sessions per query and holding them.

Nobody had ever asked the question “how many connections does this actually need at peak.” max_connections was 100 because that was Aurora’s default for our instance class. The Rails pool was 5 because Rails default. Sidekiq concurrency was 25 because that’s what the gem README suggests. Each piece reasonable in isolation. Together, a polite traffic jam dressed up as a database capacity problem.

What I tried first that did not help

My first move was the obvious wrong one. I bumped max_connections to 300 in the parameter group and rebooted the writer (Aurora needs a reboot for that parameter, which is its own delight at 3 AM). Connections came back. Logins recovered. I was about to crawl back to bed.

Twenty minutes later the writer started swapping. Each Postgres backend allocates real memory, and tripling backends on the same instance class meant tripling per-connection RAM budget. The writer wasn’t sized for it. Latency on INSERT started climbing. Same alarm pattern, different shape.

Bumping max_connections doesn’t solve pool exhaustion. It trades a connection problem for a memory problem on the same box. The lever you actually want is a connection pooler that lets a small number of physical backends serve a large number of client sessions. That’s what PgBouncer is for.

Putting PgBouncer in front of Postgres

We deployed PgBouncer in front of the writer in transaction pooling mode. Same pattern many people end up at eventually, which is fine. There is no prize for inventing it yourself.

[databases]
app_production = host=writer.cluster-xxxx.eu-west-1.rds.amazonaws.com port=5432 dbname=app_production

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction
max_client_conn = 2000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 60
server_lifetime = 1800
query_wait_timeout = 30

log_connections = 0
log_disconnections = 0
stats_period = 60

pool_mode = transaction is the important line. Each client transaction borrows a backend, returns it on commit. Rails, Sidekiq, Metabase, the batch job, they all open as many client connections as they want, up to max_client_conn. PgBouncer holds a small number of real Postgres backends and rotates them across transactions.

Two things to know if you copy this. Prepared statements need care under transaction pooling. Rails 7.1+ handles this with prepared_statements: false on the AR config when behind PgBouncer, or with the newer protocol-level support in PgBouncer 1.21+. And SET statements that depend on session state will not survive the transaction boundary. We hit one place where the app was relying on a session-level search_path. Moved it into the connection callback. Done.

Idle in transaction is the silent killer

The PgBouncer change fixed acute pool exhaustion. It did not fix the Metabase analysts.

A Metabase analyst opens a query, the dashboard renders, the analyst goes to lunch. Postgres sees a session that opened a transaction, ran a SELECT, and never sent a COMMIT or a ROLLBACK. It sits in idle in transaction. Under transaction-mode pooling that’s bad in a different way, the backend stays glued to that one client session and can’t be rotated.

The cure is a Postgres timeout, applied per-role.

# config/initializers/postgres_timeouts.rb

Rails.application.config.after_initialize do
  next unless ActiveRecord::Base.connected?

  ActiveRecord::Base.connection.execute(<<~SQL)
    ALTER ROLE app_user SET statement_timeout = '30s';
    ALTER ROLE app_user SET idle_in_transaction_session_timeout = '15s';
    ALTER ROLE app_user SET lock_timeout = '5s';

    ALTER ROLE sidekiq_user SET statement_timeout = '120s';
    ALTER ROLE sidekiq_user SET idle_in_transaction_session_timeout = '30s';
    ALTER ROLE sidekiq_user SET lock_timeout = '10s';

    ALTER ROLE batch_user SET statement_timeout = '15min';
    ALTER ROLE batch_user SET idle_in_transaction_session_timeout = '1min';
    ALTER ROLE batch_user SET lock_timeout = '30s';

    ALTER ROLE analytics_user SET statement_timeout = '5min';
    ALTER ROLE analytics_user SET idle_in_transaction_session_timeout = '30s';
    ALTER ROLE analytics_user SET lock_timeout = '2s';
  SQL
end

Per-role timeouts are the unsung hero here. The web app should never hold a transaction open for 30 seconds. The batch job legitimately might. A Metabase query that’s been idle in transaction for 30 seconds is an analyst who closed their laptop, not a workload. Kill it.

This one initializer probably saved us from the next outage we never had.

Separate the health check pool

One more thing bit us, which only showed up because we instrumented the recovery.

The Kubernetes liveness probe for the web pods was running SELECT 1 against the main app database, through the main connection pool. When the pool was saturated, the probe couldn’t get a connection. Kubernetes saw that as the pod being unhealthy. It killed and restarted the pod. The replacement pod tried to open its pool. Couldn’t. Got killed. Multiply by twelve.

The pool was exhausted because traffic was high. We then killed the application that was serving the traffic, which gave the analysts and the batch job even more of the pool, which made it worse.

The fix is small and almost embarrassing. We added a dedicated PgBouncer pool for system probes, mapped to a separate small Postgres role with a tiny default_pool_size. Liveness probes hit :6432 on the probe pool. Application traffic hits :6432 on the app pool. They cannot starve each other.

That’s the part I think about most. Not the connection ceiling, not the timeouts. The fact that we’d lovingly set up Kubernetes to murder our own pods every time the database breathed wrong.

What stuck after the incident

PgBouncer in transaction mode in front of every Postgres writer. Per-role timeouts declared in code so a junior engineer setting up a new role can’t forget. A separate pool for k8s probes. And a one-line addition to the runbook: “Before touching max_connections, pull pg_stat_activity grouped by state and application_name. The answer is usually there.”

Takeaways

max_connections is a ceiling, not a tuning knob. Don’t grow your way out.
PgBouncer in transaction pooling mode is the sane default, not a special-case fix.
idle_in_transaction_session_timeout is mandatory the moment a human can run SQL against your database.
Use per-role Postgres timeouts. Web, workers, batch, analytics should not have the same limits.
Health probes need their own pool. Otherwise you build a self-DOSing system.
Always pull pg_stat_activity before scaling anything. The answer is almost always in there.

Thanks for reading. If you’ve got thoughts, send them my way.