The Memory Limit We Copy-Pasted From Stack Overflow

kubernetes debugging performance consulting devops

A client's pods were getting OOMKilled during peak traffic, but the team spent days chasing application bugs. The real problem was resource limits that nobody had revisited since the initial cluster setup.

The first thing the tech lead told me was "it's definitely a memory leak in the Node.js service." Their order processing API had been crashing two or three times a day for the past week, always during the afternoon traffic peak. The team had already spent four days on it. They'd profiled heap usage locally, added memory tracking middleware, and even replaced their image processing library thinking it was the culprit.

None of it helped. The crashes kept happening.

The clue nobody checked

I asked to see the pod events. The engineer I was working with pulled up Grafana, showed me request latency charts, error rate dashboards. All useful, but not what I was after.

kubectl describe pod order-api-7d4f8b6c9-x2k1m -n production

About two thirds of the way down the output, there it was:

Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137

The pods weren't crashing because of a bug. Kubernetes was killing them because they exceeded their memory limit.

256 megabytes, forever

The team had containerized the application about eighteen months earlier. During the initial setup, someone had set the resource limits in the deployment manifest:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "500m"

When I asked where the 256Mi number came from, the answer was refreshingly honest: "I think we got it from a tutorial." The application had been much smaller back then — fewer endpoints, no image processing, no PDF generation. The limit worked fine for months, so nobody thought about it again.

But the application had grown. New features meant new dependencies. The PDF generation library alone could spike memory usage by 80MB per concurrent request. During afternoon peaks, when they were processing 40-60 orders per minute, three or four PDF generations happening simultaneously would push the container past 256MB. Kubernetes would terminate the pod, the replica set would spin up a new one, and for 10-15 seconds requests would fail or queue up.

Why the team didn't catch it

This is the part that's worth talking about. The team had monitoring. They had Grafana dashboards, Prometheus metrics, PagerDuty alerts. But their alerting was built around application-level signals — HTTP error rates, response times, database query duration. Nobody had set up alerts for container-level events like OOMKills.

The application logs showed nothing useful because there was nothing to log. When a process gets killed by the OOM killer, it doesn't get a chance to write a graceful error message. The process just stops. From the application's perspective, it's as if the power went out.

The team's mental model was "crashes = bugs in our code." That's a reasonable default, but it made them blind to infrastructure-level causes. They were debugging the application while the infrastructure was the one pulling the trigger.

Tip

If your pods are restarting and your logs show no errors, check kubectl describe pod for OOMKilled status. It's the first thing to rule out before chasing application-level memory leaks.

The fix was embarrassingly simple

We profiled the actual memory usage under realistic load. Ran a load test that mimicked afternoon traffic patterns — 50 concurrent users, a mix of order creation and PDF generation. Peak memory usage hit 410MB.

We updated the limits:

resources:
  requests:
    memory: "512Mi"
    cpu: "200m"
  limits:
    memory: "768Mi"
    cpu: "1000m"

The requests value matched steady-state usage. The limit gave enough headroom for peak spikes without letting a genuine leak consume the node. We also added a Prometheus alert for container restarts with an OOMKilled reason:

- alert: PodOOMKilled
  expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} in pod {{ $labels.pod }} was OOMKilled"

The afternoon crashes stopped immediately.

The actual lesson

Resource limits aren't a set-and-forget decision. They're a living part of your infrastructure config that should evolve as your application does. I've seen this pattern at four different clients now: limits get set during initial deployment based on a guess, a tutorial, or whatever the default Helm chart ships with. Then nobody revisits them.

Most teams treat Kubernetes manifests as boilerplate. They're YAML files that got copy-pasted during setup and haven't been meaningfully reviewed since. But those manifests control how your application behaves under pressure. They deserve the same attention as application code — periodic review, load testing, and monitoring.

If you're running workloads in Kubernetes, spend an hour checking whether your resource limits still make sense. Pull up actual memory and CPU usage from the last 30 days, compare it to what's in your manifests, and adjust. It's not glamorous work, but it's the kind of thing that prevents three-day debugging detours that end with "oh, it was the YAML."