Your Healthcheck Endpoint Is Probably Lying

reliability infrastructure kubernetes consulting

Most healthcheck endpoints return 200 OK as long as the process is running. That's not a healthcheck — it's a pulse check. Here's what happened when we confused the two, and what a real healthcheck should verify.

A client's payment processing service went down on a Friday afternoon. Or rather, it had been down for 47 minutes before anyone noticed. The load balancer's health dashboard showed all four instances as healthy. Kubernetes reported every pod as ready. Monitoring was green across the board.

The service was up. It just couldn't process payments.

What the healthcheck actually checked

Here's what their endpoint looked like:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

That's it. The endpoint confirmed that the Node.js process was running and could respond to HTTP requests. It said nothing about whether the service could do its actual job.

The root cause was a dead connection to their payment gateway. The connection pool held stale connections after a brief network partition, but the service never re-established them. Incoming payment requests would hang for 30 seconds, then timeout. But the healthcheck? Green.

The load balancer dutifully routed traffic to all four instances. Every single one was broken in the same way.

The pulse check problem

Most healthcheck endpoints I encounter in consulting are pulse checks, not healthchecks. They confirm the process is alive. That's the easiest thing to verify and the least useful.

A service can be alive and completely useless. The database connection might be dead. A required API key might have expired. The disk might be full. The connection pool might be exhausted. The service might have entered some degraded state where it accepts requests but can't fulfill them.

When your load balancer checks health, it's asking a specific question: "Should I send traffic to this instance?" A 200 response means yes. If the instance can't actually serve that traffic, you've lied to your infrastructure.

Liveness vs readiness

Kubernetes got this distinction right by separating the concepts into two probes, even if many teams configure them identically — which defeats the purpose entirely.

Liveness answers: "Is this process stuck?" If the liveness check fails, Kubernetes kills the pod and restarts it. This should be lightweight. A deadlocked process or an out-of-memory situation should fail liveness.

Readiness answers: "Can this instance handle requests right now?" If readiness fails, Kubernetes removes the pod from Service endpoints. Traffic stops flowing to it, but the pod stays alive. This is where you check dependencies.

The confusion happens when teams use a single /health endpoint for both.

// Two probes, zero information
app.get('/healthz', (req, res) => res.sendStatus(200));
app.get('/readyz', (req, res) => res.sendStatus(200));

What a real readiness check looks like

A readiness check should verify the things that, if broken, mean this instance can't serve traffic:

app.get('/readyz', async (req, res) => {
  const checks: Record<string, boolean> = {};
 
  try {
    await db.query('SELECT 1');
    checks.database = true;
  } catch {
    checks.database = false;
  }
 
  try {
    await redis.ping();
    checks.cache = true;
  } catch {
    checks.cache = false;
  }
 
  const poolStats = db.pool.stats();
  checks.connectionPool = poolStats.idle > 0 ||
    poolStats.active < poolStats.max;
 
  const healthy = Object.values(checks).every(Boolean);
  res.status(healthy ? 200 : 503).json({ healthy, checks });
});

The check should be fast — you don't want the health probe itself becoming a performance problem. It should test real dependencies, not just return a canned response. And it should return structured data so you can tell what failed, not just that something failed.

Warning

Be careful about checking external services in your readiness probe. If it hits a third-party API with a rate limit, your own healthchecks might get you throttled. Check connection pool state and local connectivity instead.

The cascade nobody expected

Back to the payment service. The immediate fix was straightforward: restart the pods, re-establish the gateway connections. But the real fix was building a readiness check that tested the payment gateway connection.

After we deployed the new healthcheck, we discovered something else. The gateway had brief connectivity blips roughly twice a week — lasting 10 to 20 seconds each. The old healthcheck never caught these, so every blip meant a few dozen failed payments that landed in a retry queue and sometimes got lost.

With the new readiness probe, Kubernetes would pull affected pods from the Service within seconds. By the time the connection recovered, no user traffic had reached the broken instance. Failed payments during gateway blips dropped from an average of 34 per incident to zero.

That was the number that got the team's attention. Not "better healthchecks" as an abstract best practice. Thirty-four fewer failed payments per incident, twice a week.

The tricky part: partial health

Not every dependency failure should make your service unready. If your API can still serve reads but can't write to the analytics pipeline, is it healthy? Depends on what you're optimizing for.

Some teams solve this with multiple endpoints — /readyz/critical and /readyz/full — where the load balancer checks the critical path and monitoring alerts on the full picture. Others use a scoring model where degraded dependencies reduce a health score rather than flipping a binary switch.

There's no universal right answer. But the wrong answer is always the same: returning 200 without checking anything.

What I look for now

After hitting this pattern at three different clients, I have a short checklist when reviewing any service's health endpoint.

Does the readiness probe test actual dependencies? Not just "is the process running" but "can this instance do its job right now."

Does it fail fast? A healthcheck with a 10-second timeout that matches the probe interval is worse than useless. It ties up probe capacity and delays detection.

Is liveness separate from readiness? A pod that can't reach the database shouldn't be killed and restarted in a loop. It should be pulled from rotation until the database comes back.

Does anyone treat healthcheck failures as a signal? Even before they cause user impact, readiness probe flickers are often an early warning that a dependency is degrading.

The healthcheck endpoint is the one part of your service that talks directly to your infrastructure. It's the contract between your code and everything that decides where traffic goes. A lazy implementation means your entire routing layer is operating on bad data — and you won't find out until someone asks why payments have been failing for 47 minutes.

What does your healthcheck actually verify?