The E2E Test Suite That Cried Wolf

testing developer-experience consulting ci-cd

How a 45-minute end-to-end test suite trained an entire team to ignore test failures — and what we did about it.

The Slack message had become a ritual. Every morning around 9:15, someone would post in #engineering: "E2E tests are red again, looks like the usual timeout stuff. Merging anyway."

Nobody questioned it. The end-to-end suite had been failing intermittently for months. Sometimes it was a Selenium timeout. Sometimes a test database hadn't seeded properly. Sometimes nobody could figure out why. The team had 187 end-to-end tests covering a mid-sized B2B SaaS product, and on any given run, somewhere between 3 and 12 of them would fail for reasons unrelated to the code being shipped.

I was brought in to consult on "CI/CD reliability." It took about two days to realize the CI pipeline wasn't the problem. The test suite was.

The real cost wasn't compute time

Sure, the suite took 45 minutes to run. That's painful. But teams survive slow pipelines — they just context-switch and come back. The actual damage was subtler.

Because failures were routine, the team had stopped treating red builds as signals. They'd scan the failed test names, mentally classify them as "known flaky" or "maybe real," and merge if nothing looked suspicious. This was a judgment call made dozens of times a week, often by the most junior person on the team. The suite had become a boy who cried wolf, and the team had learned to stop listening.

During my second week, a real regression slipped through. A payment webhook handler had a race condition that only manifested when two events arrived within 50ms of each other. One of the E2E tests actually caught it — but the test had been on the team's mental "ignore list" because it failed roughly once a week due to unrelated timing issues. The bug made it to production. A customer noticed before the team did.

Counting what we actually had

I asked the team to categorize every E2E test into one of three buckets:

Stable — green on the last 20 consecutive runs
Flaky — failed at least once in the last 20 runs without a corresponding code change
Broken — hadn't passed in over a week

The numbers were grim. Out of 187 tests: 94 were stable, 71 were flaky, and 22 were outright broken. Nearly half the suite was unreliable.

We dug into the flaky ones. The causes broke down roughly like this:

Timing dependencies (31 tests): waiting for animations, polling intervals, or async operations with hardcoded sleeps instead of proper waits
Shared test state (18 tests): tests that passed in isolation but failed when another test ran first and left behind dirty data
Environment coupling (14 tests): tests that depended on specific third-party sandbox states, seed data, or container startup order
Genuine nondeterminism (8 tests): race conditions in the application code itself — these were actually valuable signals hiding in the noise

What we cut, what we kept, what we moved

We didn't delete 80% of the tests and call it a day. That makes for a good headline but a bad strategy. Instead, we restructured.

The 22 broken tests got reviewed one by one. Eleven tested features that had been intentionally changed — the tests were simply never updated. We deleted those. The other eleven pointed to real gaps, so we rewrote them as integration tests that ran against the API layer without a browser.

For the 71 flaky tests, we applied a simple rule: if the test could be rewritten as an API-level integration test without losing meaningful coverage, we moved it down. Most UI-triggered workflows — create an invoice, apply a discount, send a notification — don't actually need a browser to verify that the backend logic works. The browser layer adds value only when you're testing actual UI behavior: form validation, navigation flows, accessibility.

We ended up with this:

The final E2E suite had 63 tests. It ran in 8 minutes. And it was green.

The rule we adopted

The team agreed on a policy going forward: if an E2E test fails twice in a week without a code-related cause, it gets quarantined. Quarantined tests run in a separate, non-blocking pipeline job. They still execute — you can see the results — but they don't gate merges. Every two weeks, someone from the team spends a couple of hours either fixing quarantined tests or demoting them to integration tests.

Warning

A quarantine is not a graveyard. If you never revisit quarantined tests, you've just invented a more complicated way to ignore them. Schedule the review or don't bother with quarantine at all.

This sounds bureaucratic, but it solved the core problem: the tests that did block merges were trustworthy. When the suite went red, people actually looked.

What I'd tell teams starting from scratch

Don't aim for a testing pyramid or a testing diamond or whatever shape is trending this month. Aim for this: every test in your blocking pipeline should be one you'd investigate if it failed. If you wouldn't drop what you're doing to look at a red test, it shouldn't be blocking your deploys.

That means fewer E2E tests than you think. For most web applications, somewhere between 20 and 60 end-to-end tests cover the critical user journeys. Everything else — the field validation, the edge cases, the business logic branches — belongs in faster, more deterministic layers.

The team I worked with went from a 45-minute suite with a ~30% daily failure rate to an 8-minute suite that failed maybe once a month. More importantly, when it did fail, someone actually looked. The webhook race condition I mentioned earlier? A test caught a similar issue three weeks after we restructured. This time, the developer saw the red build, investigated immediately, and fixed it before merging.

That's what a test suite is supposed to do. Not prove that your code works — give you a reason to pause when it might not.