The Postmortems We Wrote But Never Acted On

We ran blameless postmortems after every incident. We wrote detailed action items with owners and deadlines. Six months later, 78% of those items were still open — and the same incidents kept happening.


Earlier this year I worked with a team that had, on paper, one of the best incident response processes I'd seen. Dedicated Slack channels per incident. A rotating incident commander role. Blameless postmortem within 48 hours. Detailed write-ups in Notion with timelines, root cause analysis, and action items assigned to specific engineers with due dates.

They also had the same database connection pool exhaustion incident three times in four months.

The Shelf of Good Intentions

When I audited their postmortem documents — 34 of them spanning the previous six months — a pattern jumped out immediately. The write-ups were thorough. The root cause analysis was usually correct. The action items were specific and reasonable. But when I cross-referenced those action items against what had actually been done, the numbers were grim.

Out of 126 total action items across those 34 postmortems, 98 were still open. That's a 22% completion rate. The median age of an open item was 47 days. Some had been sitting untouched for five months.

The team wasn't lazy. Their sprint velocity was solid. They shipped features on schedule. They just never got around to the postmortem work because it was always less urgent than whatever the product team needed next.

Why the System Failed

I spent a week digging into why these items rotted. Three causes kept surfacing.

Action items went into a different system than sprint work. Postmortem items lived in Notion. Sprint work lived in Jira. Engineers planned their weeks from Jira. Nobody opened the Notion doc again after the postmortem meeting ended. The action items existed in a parallel universe that had no intersection with how work actually got prioritized.

No escalation path. When an action item's due date passed, nothing happened. No notification, no standup mention, no manager follow-up. The due dates were aspirational, not operational. I asked one engineer about an item assigned to them eight weeks prior — they'd genuinely forgotten it existed.

The items were too big. "Implement circuit breakers across all external service calls" is a project, not a task. It sat in the backlog because nobody could justify pulling it into a sprint alongside feature work. Meanwhile, the narrower version — add a circuit breaker to the payment gateway call that actually caused the outage — would have taken half a day.

Warning

If a postmortem action item can't be finished in a single sprint, it's not an action item. It's a project proposal wearing an action item's clothes.

What We Changed

The fixes were boring. No new tools. No expensive incident management platform. Just process adjustments that made the existing work visible.

First, we moved action items into Jira. Every postmortem action item became a ticket, tagged with an incident-followup label, linked to the postmortem document. This meant they showed up in backlog grooming, sprint planning, and the team's velocity metrics. They were no longer invisible.

Second, we added a weekly digest. Every Monday, a simple script queried Jira for open incident-followup tickets and posted a summary to the team's Slack channel. Just the ticket title, assignee, and age in days. No shaming, no commentary. But suddenly everyone could see that Sarah's connection pool fix had been open for 23 days, and that created enough social pressure to get it prioritized.

Third, we right-sized the items. During each postmortem, I pushed back on any action item that couldn't be completed within two weeks. "Implement circuit breakers everywhere" became "add a circuit breaker to the payment service client with a 5-second timeout and fallback to cached response." The big systemic improvements got written up as separate proposals that went through the normal planning process — not disguised as postmortem follow-ups.

Fourth, we gave the engineering manager veto power over sprint scope. If the team had more than three open incident follow-up items, the EM could pull a lower-priority feature story to make room. This was the hardest change to land because the product team pushed back, but one conversation made it click: I showed them the three identical connection pool incidents and asked how much engineering time the repeated pages, investigation, and customer communication had cost. The answer was roughly 40 engineer-hours. The fix was an 8-hour task that had been deprioritized three sprints in a row.

Three Months Later

The completion rate went from 22% to 81%. More importantly, the repeat incident rate dropped. Over the following quarter, they had zero incidents that traced back to a known, previously-identified root cause. They still had new incidents — that's unavoidable — but they stopped having the same ones.

The median time from postmortem to action item completion went from "never" to 9 days.

I won't pretend this was revolutionary. None of these ideas are new. But that's kind of the point. The team didn't lack knowledge about incident management best practices. They had read the Google SRE book. They knew what blameless postmortems were supposed to look like. The failure wasn't in the postmortem itself — it was in the gap between writing down what should change and actually changing it.

The Uncomfortable Part

Here's what I keep thinking about after this engagement. Postmortems have become a ritual in our industry. We perform them because good engineering teams are supposed to. But a postmortem that generates action items nobody completes is worse than no postmortem at all — it creates an illusion of learning while the same failure modes sit dormant, waiting.

The real test of an incident response culture isn't whether you write postmortems. It's whether you can look at last quarter's action items and honestly say most of them are done. If you've never checked, that might be worth an uncomfortable afternoon with a spreadsheet.