The Cron Job That Ran Twice (And Charged Everyone Twice Too)

kubernetes distributed-systems consulting debugging reliability

A consulting story about a nightly billing job that quietly started double-charging customers after a Kubernetes migration — and the boring lock that finally fixed it.

The first time I heard about it, the support inbox had 14 messages from the same morning. All variations of the same complaint: "Why was I billed twice for March?" The client was a B2B SaaS company that ran a nightly invoicing job for around 8,000 customers. It had been running, unchanged, for almost three years. It had never double-charged anyone.

Then they migrated from a single VM to Kubernetes, and the world got interesting.

What changed (and what I assumed)

When I joined the engagement, the migration was already done. The old setup was a single Ubuntu box running a Node.js process triggered by cron at 02:00. The new setup was a Kubernetes CronJob resource pointed at the same container image, the same database, the same Stripe API keys. From the team's perspective they had moved a crontab line into a YAML file. Nothing else.

My first assumption was that someone had broken the job's idempotency — that the code had always been duplicate-safe and the migration had introduced a regression. I spent half a day reading the billing service. It was not idempotency-safe. It had never been. The single VM had simply made it impossible for the job to run twice.

That's the thing about migrations. They don't break code. They expose the assumptions the code was built on.

How a CronJob fires twice

Kubernetes CronJob is not a magic, exactly-once scheduler. It schedules a Job, the Job creates a Pod, and the Pod does the work. There are at least three ways this can go sideways:

The kube-controller-manager misses a scheduled time and tries to "catch up" later, depending on startingDeadlineSeconds.
A node becomes NotReady while a Pod is running. The controller can decide the Pod is gone and start a new one. The original may still be alive, just unreachable.
The default concurrencyPolicy is Allow. If a job runs long, the next scheduled invocation will start regardless of whether the previous one finished.

In our case it was a combination of the last two. The cluster had been having intermittent network issues with one of the worker nodes. On the night of the incident, the billing pod had been scheduled there. The node went NotReady for about four minutes during the run. The controller marked the pod as lost and scheduled a fresh one on a healthy node. The original pod was very much alive and most of the way through processing 8,000 invoices. Both finished. Stripe got charged twice. So did the customers.

Warning

If your job mutates external state, do not assume Kubernetes will run it once. Even with concurrencyPolicy: Forbid, network partitions and stuck pods can produce duplicate executions. The scheduler is best-effort, not exactly-once.

The first fix that wasn't enough

The obvious move was to set concurrencyPolicy: Forbid. We did that immediately. It would stop the next scheduled run from starting if the previous one was still active. But it does nothing about the case that actually bit us: the controller deciding a pod is dead when it isn't. So we kept looking.

The team's instinct was to add a database flag — set billing_run.status = 'running' at the start, check it before starting. I've seen this pattern fail enough times to push back. Two pods can read the row at the same instant, both see "not running," both update it. Without an atomic compare-and-set, you've just built a race condition with extra steps.

What we actually needed was a lock that:

Was acquired atomically.
Had an owner identity, so a second runner could tell the difference between "someone else holds this" and "I hold this."
Expired automatically if the holder died, so a crashed pod wouldn't lock us out forever.
Could be released early on a clean shutdown.

The actual fix

We already had Postgres. We did not need Redis or etcd or ZooKeeper for a job that runs once a night. Postgres advisory locks gave us everything above with about twelve lines of code:

-- Try to grab a session-level advisory lock.
-- pg_try_advisory_lock returns true if we got it, false if not.
SELECT pg_try_advisory_lock(hashtext('billing-nightly'));

In the Node.js side, the job opens a dedicated connection, runs pg_try_advisory_lock, and bails out cleanly if it returns false. The lock is tied to the session, so if the pod dies, Postgres releases it the moment the connection closes. No TTLs to tune, no clock skew to reason about, no extra infrastructure to operate.

We paired that with an idempotency table at the row level — a unique constraint on (customer_id, billing_period) — so even if two processes somehow slipped past the advisory lock, the database would refuse the second insert. Belt and suspenders, but billing is one of those areas where I am happy to wear both.

What I'd do differently

I should have asked "what happens if this runs twice?" on the first day, not the third. The migration review focused on resource limits, image pull secrets, RBAC. Nobody asked the question that mattered: which of these jobs assume they are the only instance running? In a single-VM world that was a free assumption. In a Kubernetes world it is a claim that needs proof.

The other thing I should have done sooner is read the actual CronJob source code in kube-controller-manager. I spent hours debugging on the assumption that concurrencyPolicy: Forbid was enough. Twenty minutes of reading the controller would have told me it isn't.

We refunded every duplicated charge within two days. The client was gracious about it. But I keep thinking about the other jobs in their cluster — the ones that send emails, generate reports, sync data to a warehouse — and wondering how many of them are also quietly assuming they will only ever run once. How would you even know until they didn't?