Our P99 Latency Doubled in Three Months and Nobody Noticed

A client's API was getting measurably slower every week. The dashboards were green, the alerts were silent, and the database looked healthy. The problem was hiding in plain sight — on the container's local disk.


When a client's support team started hearing "it feels slow" from their users in March, the engineering team pulled up their Grafana dashboards and saw nothing wrong. Average response times were within acceptable bounds. Error rates were flat. The alerts were quiet.

"Users are probably just complaining," someone suggested in the standup. They moved on.

I joined the engagement six weeks later for unrelated infrastructure work. During my onboarding, the tech lead mentioned the slowness complaints almost as an aside. "We looked into it. Dashboards are clean. Might be client-side."

On a hunch, I asked if I could see their raw latency data. Not the averages — the percentiles.

The trend nobody was watching

Their monitoring setup was typical for a mid-size team: Prometheus scraping metrics, Grafana dashboards, PagerDuty alerts. The dashboards showed average response time, which hovered around 180ms. Well within their 300ms SLO. Green across the board.

But when I plotted the P99 latency over the past 90 days, the picture was different. In early January, P99 was sitting at around 220ms. By mid-February, it had crept to 400ms. When I checked the current numbers in late March, it was at 870ms.

The P99 had nearly quadrupled. And their alerting hadn't fired once, because every alert was configured as a static threshold on the average — and the average was still fine.

The usual suspects

I started where anyone would. The database. Query performance in pg_stat_statements was stable — the top 20 queries hadn't moved in execution time. Connection pool utilization was healthy at around 60%. No lock contention.

Network latency between services? Flat. External API call times? Unchanged. CPU usage on the application pods? A steady 35-40%.

Then I noticed the disk I/O metrics. The containers were showing elevated iowait — not dramatic, but a slow upward trend that matched the latency pattern almost exactly.

The logs that wouldn't leave

The application was a Node.js service running in Kubernetes. Like most services, it wrote structured JSON logs via pino. Good practice, nothing unusual.

The log shipping configuration had a gap, though. The team had set up Fluentd to forward logs to their centralized logging platform, and it was working. But the application also wrote to stdout, which the container runtime captured to the node's local disk via the default JSON log driver. Two copies of every log line: one shipped properly, one quietly accumulating on disk.

Here's where it got interesting. The containers had been running for over four months without recycling. The team was proud of their uptime — zero restarts, no OOM kills, no crashloops. But that meant four months of logs piling up on each node's ephemeral storage.

$ kubectl exec -it api-server-7b4f9 -- du -sh /var/log/
18G     /var/log/

Eighteen gigabytes of compressed JSON logs on a container with 20GB of ephemeral storage. The disk was 90% full, and every log write was competing with the application for I/O bandwidth on the same volume.

The team had log rotation configured in their Dockerfile via logrotate. But logrotate runs as a cron job, and the container didn't have cron running. The configuration existed. It just never executed. Nobody had tested it because deploys used to happen frequently enough that containers rarely lived longer than a few days.

The fix

The immediate fix took about 20 minutes. We configured the container runtime's log rotation directly on each node, capping file size and count:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}

Then we did a rolling restart of the affected pods. P99 latency dropped from 870ms back to 230ms within the hour.

The longer-term fix had three parts.

We moved the log pipeline to use Kubernetes-native log rotation through containerd configuration rather than relying on application-level logrotate that assumed a full Linux init system.

More importantly, we added trend-based alerting. Instead of just "alert if average latency exceeds 300ms," we added a rule that triggered if P99 latency increased by more than 20% week-over-week for two consecutive weeks. This catches gradual degradation early, before users start complaining.

We also set pod lifetime limits. Healthy containers running forever sounds good in theory, but it lets slow-burn issues accumulate. The team configured periodic voluntary restarts — not because the code needed it, but because the environment did.

Note

Most monitoring setups are designed to detect sudden failures — not gradual degradation. A system can get 2% slower every week and never cross a static threshold.

The real lesson

The problem here wasn't really about logs or disk space. Those were symptoms.

The real gap was that the team's monitoring was optimized for detecting sudden failures — spikes, outages, error rate jumps — and completely blind to gradual degradation. Their dashboards showed the last hour or the last day. Nobody had a 90-day view of P99 latency. Nobody was tracking week-over-week trends.

This is more common than people think. Most alerting setups answer "is it broken right now?" and never ask "is it getting worse?" A system can degrade 2% per week and still look healthy on every dashboard for months. By the time it crosses a static threshold, users have been suffering silently — or worse, they've stopped bothering to report it.

I've started recommending that every team I work with adds at least one trend-based alert for their core latency metrics. Something that watches the direction, not just the current value. It won't catch everything. But it catches the class of problems that static thresholds miss entirely: the slow leaks, the creeping latencies, the resources that fill up one percent at a time.

What gradual degradation is hiding in your dashboards right now?