Six Dashboards, Zero Answers

observability monitoring consulting debugging devops

A client had six monitoring tools and still couldn't diagnose a production incident in under an hour. The problem wasn't the tools — it was what happens when observability grows by accretion instead of design.

The Slack message came in at 10:14 AM on a Tuesday. "API errors spiking, customers reporting timeouts." By the time the on-call engineer started investigating, four different people had already pasted screenshots from four different dashboards into the incident channel. None of them told the same story.

This was week two of a consulting engagement focused on reliability improvements for a fintech client. Their stack was a fairly standard Node.js and Python microservices architecture running on AWS, serving about 12,000 requests per minute at peak. The system itself was fine. Their ability to understand the system when something went wrong was not.

The archaeology of monitoring

Nobody at the company had set out to build a six-tool monitoring stack. It just happened. Over three years, each tool arrived with a reasonable justification.

CloudWatch was there from the start because they ran on AWS. Then someone added Sentry for error tracking because CloudWatch's error reporting was too coarse. Datadog came in when the platform team needed APM traces. A previous contractor had set up a Grafana instance pointing at Prometheus for custom infrastructure metrics. PagerDuty handled alerting. And there was an aging Elasticsearch cluster that ingested application logs, queried through a Kibana dashboard that two people knew how to use.

Each tool had its own retention policy, its own tagging conventions, its own authentication. The Datadog traces used service names like payment-api, while Sentry called the same service payments-service. CloudWatch metrics were tagged by Auto Scaling group names that hadn't matched the actual service names since a migration eight months earlier.

What the incident looked like

Back to that Tuesday morning. The actual problem turned out to be a downstream payment provider returning HTTP 503s intermittently. Took about 55 minutes to confirm. Not because the issue was complex — it wasn't. It took that long because diagnosing it required stitching together data from three different tools manually.

The on-call engineer started in PagerDuty, which pointed to a Datadog alert on elevated 5xx rates. She opened Datadog and could see the error spike, but the traces for the failing requests showed the error originating from an outbound HTTP call — no further detail. The actual response bodies from the payment provider weren't in Datadog. They were in the application logs. Which were in Elasticsearch.

She didn't have a Kibana bookmark. Slack search turned up a link from six months ago. The Kibana dashboard loaded, but the default time range was 24 hours and the query syntax was something she'd never used before. Meanwhile, someone else was in Sentry looking at a spike in PaymentProviderError exceptions, which actually had the upstream response body in the stack context — but nobody knew to look there first.

Fifty-five minutes. For what was ultimately "a third-party is returning 503s."

The real cost isn't the subscription fees

When I sat down with the team lead afterward, his first instinct was to talk about consolidating tools to save money. They were spending around $4,200 per month across all the paid tools. That's real money, but it wasn't the real problem.

The real problem was cognitive overhead during incidents. Every tool switch is a context switch. Every context switch costs time and attention. And during an incident, attention is the scarcest resource you have.

I asked the four engineers who'd been in the incident channel to independently write down the steps they took during the investigation. The overlap was minimal. They were all looking at different tools, at different time ranges, with different filters. Nobody had a shared mental model of where to start or how to escalate.

Note

If your team can't agree on which dashboard to open first during an incident, you don't have an observability strategy — you have an observability collection.

What we actually did

We didn't rip everything out. That never works. Instead, we did three things over the next six weeks.

First, we picked a single starting point. Every investigation would begin in Datadog, which had the broadest coverage. We configured it to be the entry point for all alerts through PagerDuty, and built a single "incident triage" dashboard that showed error rates, latency percentiles, and throughput for all services on one screen.

Second, we connected the gaps. The biggest pain point was jumping from a trace to the relevant logs. We added trace IDs to the structured log output and configured Datadog's log integration so that clicking a trace would show the associated log lines. This single change eliminated the Kibana detour for 80% of investigations.

Third, we decommissioned what we could. The Grafana/Prometheus setup was only used for three custom dashboards. We rebuilt those in Datadog in an afternoon. The Elasticsearch cluster stayed for now — migrating the full log pipeline was a bigger project — but it was no longer the first place anyone needed to go during an incident.

Sentry stayed too. Its error grouping and release tracking are genuinely good for things Datadog doesn't handle well. But we made it a second-tier tool: something you go to for deep error analysis, not something you check during initial triage.

The after

Two months later, the same team handled a similar incident — a different upstream provider, same class of failure. Time to diagnosis: eight minutes. Not because they had better tools. Because they had one path through the tools, and everyone knew what it was.

The monthly tooling spend dropped from $4,200 to about $3,100 after decommissioning Grafana Cloud and downsizing the Elasticsearch cluster. Useful savings, but secondary to the fact that the median incident response time went from 47 minutes to 14 minutes over the following quarter.

The pattern I keep seeing

This wasn't a unique situation. I've seen variations of this at four different clients in the past two years. The pattern is always the same: tools get added to solve specific problems, nobody is responsible for the overall observability architecture, and over time the monitoring stack becomes a problem in itself.

The industry is catching up to this. There's a visible push toward consolidated observability platforms right now, and for good reason. But consolidation alone doesn't fix anything if you don't also establish a shared investigation workflow. A single tool with no agreed-upon starting dashboard is only marginally better than six tools with no agreed-upon starting dashboard.

If your team has more than three monitoring tools and no written runbook for where to start an investigation, it might be worth spending an afternoon mapping out the actual paths your engineers take during incidents. The gaps and redundancies tend to be surprising.