The Connection Pool That Starved at 3 PM Every Day

database debugging performance consulting reliability

A client's API started throwing 500s every weekday afternoon like clockwork. The database was fine. The queries were fast. The problem was a reporting job that quietly hogged every available connection during peak traffic.

"It fixes itself after about twenty minutes." That was how the tech lead described the problem. Every weekday around 3 PM, their API would start returning 500 errors. Latency would spike from 40ms to 30 seconds, then to outright timeouts. By 3:20, things calmed down. By 3:30, everything looked normal again.

The team had been living with it for two months. They'd even given it a name internally: "the afternoon dip." Someone had written a runbook entry that said, essentially, "wait it out."

I was four days into a consulting engagement focused on general reliability improvements. This seemed like a good place to start.

The obvious theory was wrong

The team's assumption was a traffic spike. Afternoon hours meant more users, more requests, more load. Reasonable on the surface. But when I pulled up their Cloudwatch request metrics, the traffic curve was smooth. No sudden spike at 3 PM. Requests per second did climb slightly between 2 and 4 PM, but it was a gentle ramp — maybe 15% above the morning baseline. Nothing that should cause outright failures.

The API was a Node.js service (Express, TypeScript) backed by PostgreSQL on RDS. I checked the database metrics next. CPU usage during the "dip" barely moved — it sat around 25%. No lock contention showing up in pg_stat_activity. Query latency on the database side stayed under 10ms for the standard queries. The database was bored.

So the API was failing, the database was fine, and traffic wasn't unusual. Something in between was broken.

Twenty connections and a crowded hallway

I added a quick diagnostic endpoint that dumped the connection pool stats from their pg library:

app.get('/debug/pool', (req, res) => {
  const pool = getPool();
  res.json({
    total: pool.totalCount,
    idle: pool.idleCount,
    waiting: pool.waitingCount,
  });
});

I hit it every few seconds leading up to 3 PM. At 2:55, the numbers were healthy: 6 total connections, 3 idle, 0 waiting. At 3:01, total jumped to 20 (the configured max), idle dropped to 0, and the waiting count started climbing. By 3:05, there were 140 requests queued up waiting for a connection that wasn't coming.

The pool was full. Not because the database was slow, but because something was holding connections and not giving them back.

The report nobody mentioned

I asked what ran at 3 PM. Blank stares. Then someone from the product team said: "Oh, the daily summary report? That's just a cron job."

The cron job triggered an internal API endpoint that generated a daily sales summary for the finance team. It ran eight queries in parallel — aggregations across orders, payments, refunds, and inventory adjustments. Each query scanned a few million rows and took 30 to 45 seconds to complete. Not slow for analytical queries, honestly. The database handled them without breaking a sweat.

But each of those eight queries held a connection for its entire duration. The pool max was 20. So at 3 PM, eight connections got locked up for 30-45 seconds. That left twelve connections for all regular API traffic. During the busiest part of the day, twelve connections weren't enough. The waiting queue grew. The connectionTimeoutMillis was set to the default of zero — meaning it would wait forever. Requests stacked up behind the pool, Express kept accepting new ones, and eventually clients hit their own timeouts and got 500s from the gateway.

The report finished, the eight connections returned, the backlog drained, and the "afternoon dip" resolved itself. Every single day.

Why defaults almost killed them

Two configuration details made this worse than it needed to be. First, connectionTimeoutMillis: 0 — infinite wait. A request would sit in the pool queue indefinitely rather than failing fast. If this had been set to, say, 5 seconds, the API would have returned errors sooner but recovered faster. The backlog wouldn't have snowballed.

Second, and more fundamentally: one pool for everything. The fast transactional queries (get a user, fetch an order, update a cart) shared the same 20-connection pool as multi-second analytical queries. That's like having a single-lane road for both commuter traffic and freight trains. The freight doesn't move faster, but it blocks everything behind it.

// Before: one pool, all traffic
const pool = new Pool({
  max: 20,
  connectionTimeoutMillis: 0, // wait forever
});
 
// After: separate pools, separate concerns
const transactionalPool = new Pool({
  max: 20,
  connectionTimeoutMillis: 5000,
});
 
const analyticalPool = new Pool({
  max: 5,
  connectionTimeoutMillis: 30000,
});

The fix

We did three things. First, we created a separate connection pool for the reporting endpoint with its own max of five connections and a 30-second timeout. The report slowed down slightly because it could only run five parallel queries instead of eight, but it went from taking 45 seconds to about 70 seconds. The finance team didn't notice.

Second, we set connectionTimeoutMillis to 5 seconds on the main pool. If all connections are busy, fail fast and let the client retry. No more infinite queue.

Third — and this was the real fix — we moved the report queries to a read replica. The reporting endpoint got its own pool pointed at the replica, completely isolating it from the transactional workload. The main pool never saw the analytical queries again.

We also added pool metrics to their Grafana dashboard: active connections, idle connections, waiting count, and wait time P95. It was one of those things where the team said "we should have had this from the start" — which is what everyone says, and almost nobody does proactively.

Note

If you're using node-postgres, pool.totalCount, pool.idleCount, and pool.waitingCount are right there. Exposing them to Prometheus takes about ten lines of code. Do it before you need it.

Shared pools, shared problems

Connection pools are shared resources, and shared resources have the same fundamental problem everywhere: one bad tenant can ruin it for everyone. A pool doesn't know the difference between a 5ms lookup and a 45-second aggregation. It just hands out connections and waits for them to come back.

The pattern I keep seeing is teams that size their pool for average load and average query duration. That works until something non-average shows up. A migration. A new report. A background job that someone added six months ago and nobody remembers. The pool doesn't care about intent — it just runs out.

If your application mixes fast and slow queries through the same pool, it's not a question of whether you'll hit contention. It's when. And it'll probably happen at the worst possible time, because slow queries hurt most when traffic is high and connections are already scarce.

What's quietly holding connections in your system right now?