The CI Pipeline Nobody Was Allowed to Touch

ci-cd developer-experience consulting devops

A 47-minute build pipeline had become sacred infrastructure. When we finally opened it up, we found cargo-culted steps, redundant checks, and a team afraid of their own tooling.

The first thing I noticed on the client project wasn't the code. It was how developers talked about merging pull requests. "Just throw it up and go get lunch." "Don't bother watching it, you'll only stress yourself out." One engineer told me she batches three or four PRs into a single merge at the end of the day because waiting for each one individually would burn her entire afternoon.

Their CI pipeline took 47 minutes. On a good day.

How a Pipeline Becomes Sacred

This was a mid-sized fintech team — about 30 developers working on a TypeScript monolith with a React frontend and a Node.js backend. The pipeline lived in a single GitHub Actions workflow file that had grown to 380 lines over two years. Every quarter, someone had added a step. Nobody had ever removed one.

The team had an informal rule: don't touch the pipeline unless something is actively on fire. The last person who'd tried to "clean it up" had broken it for three days straight, and during those three days, nothing got merged. That was eight months before I arrived. The incident had calcified into institutional fear.

There was also Marcus. Marcus was the senior engineer who'd originally set up the pipeline and was the only person who truly understood the YAML. When Marcus went on a two-week holiday the previous summer, the team shipped exactly zero PRs in the first week because a flaky step started failing and nobody knew how to investigate it.

What 47 Minutes Looked Like

I mapped out the pipeline and found twelve sequential steps. Some of them made sense. Most of them didn't — at least not in 2026.

Everything ran sequentially. The Docker image was built from scratch on every run because nobody had configured layer caching. The integration tests spun up a fresh Postgres container, ran migrations, seeded data, and then executed tests — even on PRs that only touched frontend components. The E2E suite ran the full Playwright battery against every PR regardless of what changed.

Then there were the ghosts. The security scan step — added eighteen months ago after an audit — was pointing at a deprecated API endpoint that returned 200 OK with an empty response body. It had been "passing" for a year and a half without scanning anything. The coverage report uploaded results to a Codecov dashboard that the team had stopped checking after the free tier expired. The step still ran. It still took 90 seconds. Nobody questioned it.

Untangling It

The fixes weren't clever. That's what made the situation so frustrating — this wasn't a hard problem, it was an avoided one.

Parallelization. Linting, type checking, and unit tests don't depend on each other. We split them into three parallel jobs. This alone cut about eight minutes off the pipeline.

jobs:
  lint-and-typecheck:
    strategy:
      matrix:
        check: [lint, typecheck, unit-test]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npm run ${{ matrix.check }}

Path filtering. We added path-based triggers so that frontend-only changes skipped the integration tests and backend-only changes skipped Playwright. A one-line docs change no longer triggered 47 minutes of compute.

Docker layer caching. We added docker/build-push-action with GitHub Actions cache. Image builds went from four minutes to about 40 seconds on cache hits.

Removed the dead steps. The broken security scan got replaced with a properly configured trivy scan. The orphaned coverage upload got deleted. We consolidated ESLint and Prettier into a single lint step — the team had already adopted flat config but kept both steps out of habit.

Tip

If you haven't audited your CI pipeline in the last six months, do it this week. Run each step in isolation and ask: does this still do what we think it does?

The Result

Typical PR build time dropped from 47 minutes to 12 minutes. Frontend-only changes finished in about 7 minutes. The median time from "push" to "green check" went from "go get lunch" to "refill your coffee."

But the more interesting change was behavioral. Developers started watching their builds again. PRs got smaller because there was less incentive to batch them. Review cycles shortened because authors could respond to feedback and re-run the pipeline without losing an hour. The team's weekly PR throughput went up by about 40% in the first month — not because anyone wrote code faster, but because the dead time between writing code and landing it shrank.

We also wrote a PIPELINE.md that documented every step, why it existed, and who to talk to if it broke. Marcus was relieved. He'd been the reluctant pipeline guardian for two years and hated that the role had become part of his identity.

The Uncomfortable Part

None of this was technically difficult. A junior engineer could have made most of these changes. The problem was organizational — a combination of learned helplessness, single-point-of-failure knowledge, and the very human tendency to avoid touching something that works, even when "works" means "wastes 35 minutes of every developer's day, multiple times per day."

I keep wondering how many teams are sitting on similar pipelines right now. Forty-minute builds that nobody questions because the pain has become normal. Zombie steps that exist because removing them feels riskier than leaving them. A YAML file that only one person understands.

How long has it been since you actually read your CI configuration end to end?