The Load Test That Told Us Exactly What We Wanted to Hear

performance testing architecture consulting

We ran load tests before a big product launch, got green across the board, and watched the system buckle under real traffic two days later. The tests weren't wrong — they just weren't testing reality.

Three months into an engagement last year, the client was preparing for a marketing push — their first national campaign. The CTO wanted assurance that their platform could handle 10x the normal traffic. Reasonable ask. The team had been running load tests for weeks and everything looked solid. P95 response times under 300ms at the target throughput. No errors. Clean graphs.

Two days into the campaign, the order service started returning 503s during the evening peak. The database connection pool was exhausted. API gateway timeouts cascaded into retry loops. Revenue stopped for 47 minutes.

The load tests had passed. Every single one.

What the Tests Actually Tested

The team was using k6 with a set of scripts that hit their core endpoints: product listing, product detail, add to cart, checkout. Each script ran independently at a fixed request rate, ramping up over 10 minutes to the target of 500 requests per second.

On paper, this was thorough. In practice, it had three fundamental problems that none of us caught until the production incident forced us to look.

Problem one: uniform traffic distribution. The test scripts spread requests evenly across endpoints. In reality, their traffic was bursty and correlated. When the marketing email went out at 6 PM, they didn't get a gentle ramp — they got 4,000 users hitting the product listing page within 90 seconds, then 60% of those users clicking into the same three promoted products, then a surge of add-to-cart requests for the same SKUs.

The load test modeled a river. Production was a flash flood.

Problem two: a toy dataset. The test environment had 12,000 products. Production had 340,000. The product listing endpoint used pagination with cursor-based queries that performed fine against a small catalog but degraded past 200,000 rows because of how the compound index interacted with the sort order. Nobody had profiled the queries against a production-sized dataset.

-- This looks innocent enough
SELECT id, name, price, created_at
FROM products
WHERE category_id = $1
  AND active = true
ORDER BY featured DESC, created_at DESC
LIMIT 20;
 
-- But with 340K rows and the wrong index, PostgreSQL
-- chose a sequential scan on the sort columns instead
-- of using the category index. At 12K rows, both plans
-- were fast enough that nobody noticed the difference.

Problem three: no connection lifecycle pressure. The load test scripts opened connections, made a request, and moved on. Real users held sessions. They browsed, added items, hesitated, came back. At peak, the application had 1,800 concurrent WebSocket connections for real-time inventory updates — a feature that didn't exist in any load test scenario. Each WebSocket held a database connection from the pool for its subscription query.

The connection pool was sized at 100. The team had tested with HTTP requests that borrowed and returned connections in milliseconds. They'd never tested what happened when long-lived connections ate 60% of the pool before the HTTP traffic even started competing for the rest.

The Fix Wasn't More Load Testing

After the incident, the instinct was to write better load test scripts. And we did, eventually. But the first thing we fixed was the gap between the test environment and production.

We built a dataset generator that mirrored the production catalog's shape — same number of products, same category distribution, same skew in popular items. This alone surfaced the query plan issue described above. Adding a composite index on (category_id, active, featured DESC, created_at DESC) brought that listing query from 1.8 seconds back to 12ms at production data volumes.

CREATE INDEX idx_products_category_listing
ON products (category_id, active, featured DESC, created_at DESC)
WHERE active = true;

Warning

If your load test environment has 3% of production's data, you're testing your indexes, not your queries. Query plans change with table statistics — what's fast at 10K rows can be catastrophic at 300K.

For the connection pool exhaustion, we moved the WebSocket subscriptions to a separate connection pool and set a hard ceiling on concurrent subscriptions per node. We also added connection pool metrics to our dashboards — something that should have been there from day one but wasn't, because the pool had never been a bottleneck during testing.

Rewriting the Load Tests

The new load test suite looked fundamentally different from the original. Instead of independent endpoint scripts, we wrote user journey scenarios that modeled actual behavior:

// k6 scenario: "evening_campaign_surge"
export const options = {
  scenarios: {
    campaign_burst: {
      executor: 'ramping-arrival-rate',
      startRate: 10,
      timeUnit: '1s',
      preAllocatedVUs: 2000,
      stages: [
        { duration: '30s', target: 10 },
        // Simulate email blast spike
        { duration: '90s', target: 800 },
        { duration: '10m', target: 400 },
        { duration: '5m', target: 100 },
      ],
    },
    websocket_sessions: {
      executor: 'constant-vus',
      vus: 500,
      duration: '20m',
    },
  },
};

The critical difference wasn't the tool or the scripting — it was that we modeled correlation. Real users don't arrive uniformly. They cluster around events. They all want the same products that marketing just promoted. They hold connections open while they think.

We also added a "degradation test" that we ran separately: instead of asking "can the system handle the target load," we asked "what happens when we exceed the target by 50%?" The answer should be graceful degradation — slower responses, maybe a queue — not connection pool exhaustion and cascading failures. The first time we ran this test, the system fell over at 1.2x the target load. That gap between "passes the test" and "survives reality" was exactly where the production incident had lived.

What Changed My Thinking

I used to treat load testing as a verification step. Run the numbers, check the box, ship with confidence. This engagement changed that.

Load tests don't prove your system works. They prove your model of traffic works against your model of the system. The value isn't in the green checkmark — it's in how closely those models match reality. And they never match perfectly, which means the real question isn't "did the test pass?" but "what did the test not cover?"

The team now runs their campaign-burst scenario against a production-mirror dataset as part of every release that touches the product or order paths. They also review their load test assumptions quarterly against actual traffic patterns from the past 90 days. Traffic shapes change. Products get more popular. Features get added. The model has to evolve with the system.

I still wonder, though, how many teams are running load tests that pass beautifully in a world that doesn't match where their code actually runs. Probably more than would admit it.