The Query Plan That Changed Its Mind at 3AM

databases debugging postgres consulting reliability

A routine ANALYZE flipped a Postgres query plan from an index scan to a sequential scan, and our API went from 12ms to 8 seconds. Here's what we learned about a failure mode most teams never think about.

The alert came in at 3:17 AM. API response times had jumped from 12ms p50 to over 8 seconds. Error rates were climbing as connection timeouts cascaded through the system. By the time I got to my laptop, the on-call engineer had already restarted the application servers twice. It didn't help.

The database was the same Postgres 15 instance that had been running fine for months. No schema changes. No new deployments. No traffic spike. The only thing that had changed was nothing — or so we thought.

Finding the smoking gun

The slow query log told us exactly which query was misbehaving. It was a lookup that powered the main dashboard, something that ran thousands of times per hour:

SELECT o.id, o.status, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.tenant_id = $1
  AND o.created_at > $2
ORDER BY o.created_at DESC
LIMIT 50;

Nothing exotic. The table had about 40 million rows, partitioned by tenant. There was a composite index on (tenant_id, created_at) that had been serving this query beautifully since day one. We ran EXPLAIN ANALYZE and saw the problem immediately.

Instead of the expected index scan on orders_tenant_created_idx, Postgres had switched to a sequential scan. On a 40-million-row table.

The query planner had changed its mind.

Why Postgres flipped

After some digging, we found the culprit: autovacuum had run its periodic ANALYZE on the orders table about twenty minutes before the incident started. This refreshed the table statistics that Postgres uses to decide how to execute queries. With the updated statistics, the planner concluded that the sequential scan would be cheaper.

Here's the thing — it wasn't wrong, exactly. The statistics now more accurately reflected the data distribution, and for certain tenant IDs with very large result sets, a sequential scan could theoretically be more efficient. But the planner doesn't know that 99% of queries hit tenants with modest data volumes. It saw that the most common tenant_id values matched a large fraction of the table and decided the index wasn't worth it.

This is a known failure mode, but it catches teams off guard because it's intermittent and invisible until it isn't. Your query runs fine for months. Then the planner gets slightly different statistics and makes a different choice. No code changed. No config changed. Just a background process doing its job.

Warning

Query plan instability is especially common on tables with skewed data distributions — where some partition keys match 10 rows and others match 10 million. The planner picks one plan for all parameter values.

The 4AM fix and the real fix

At 4AM, we did the expedient thing. We pinned the query plan by rewriting the query with an explicit index hint (well, Postgres doesn't have hints, so we restructured the query to make the index scan the obvious choice):

SELECT o.id, o.status, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.tenant_id = $1
  AND o.created_at > $2
  AND o.created_at <= NOW()
ORDER BY o.created_at DESC
LIMIT 50;

Adding the upper bound on created_at narrowed the estimated row count enough that the planner switched back to the index. Response times dropped to normal within seconds.

But that's a band-aid. The underlying problem was that we had a query sensitive to planner estimates, and those estimates could shift with any ANALYZE run.

Over the next week, we made three changes that actually addressed the root cause.

First, we adjusted the statistics target for the columns involved. Postgres defaults to collecting 100 histogram buckets per column. For our skewed tenant_id distribution, that wasn't enough granularity.

ALTER TABLE orders ALTER COLUMN tenant_id SET STATISTICS 1000;
ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 1000;
ANALYZE orders;

More buckets meant the planner could distinguish between "tenant with 200 rows" and "tenant with 5 million rows" instead of lumping them together.

Second, we started using plan_cache_mode = force_custom_plan for this specific prepared statement. By default, Postgres will sometimes use a generic plan for prepared statements after five executions — a plan that doesn't consider the specific parameter values. Forcing custom plans meant each execution got a plan tailored to the actual tenant being queried.

Third — and this was the change I was most insistent about — we added query plan monitoring to our observability stack. We wrote a simple check that ran EXPLAIN on our critical queries every five minutes and compared the plan structure to a known-good baseline. If the plan changed, we got an alert before response times cratered.

def check_query_plan(conn, query, params, expected_node):
    result = conn.execute(f"EXPLAIN (FORMAT JSON) {query}", params)
    plan = result.fetchone()[0][0]["Plan"]
    actual_node = plan["Node Type"]
    if actual_node != expected_node:
        alert(f"Plan changed: expected {expected_node}, got {actual_node}")

It's crude, but it caught two more plan shifts over the next month — both during off-peak hours, both before they affected users.

The broader pattern

This incident reminded me how much we take the query planner for granted. We write a query, check that it uses the right index, and move on. We almost never ask: "What happens when the statistics change?" or "Is this plan stable across different parameter values?"

The Clerk outage earlier this year followed the same pattern — a routine ANALYZE caused a plan flip that brought down their authentication service for 90 minutes. If it can happen to a team running critical auth infrastructure, it can happen to anyone.

I've started treating query plan stability the same way I treat schema migrations: something that needs to be tested, monitored, and thought about explicitly. Not something you assume will just keep working.

If you're running Postgres with tables that have skewed distributions or multi-tenant data, it might be worth running EXPLAIN on your five most critical queries right now. Compare them against what you expect. You might be one ANALYZE away from a very interesting night.