The Read Replica That Lied to Us

databases debugging consulting architecture postgres

A client moved their reads to database replicas for performance. The latency numbers looked great — until customers started getting charged twice and inventory counts drifted from reality.

The performance graphs looked fantastic. Average API response time dropped from 120ms to 45ms after the team split their database reads to a replica. The tech lead was thrilled. The product manager was thrilled. Everyone high-fived in Slack.

Three weeks later, a customer support ticket came in: "I was charged twice for my order." Then another. Then six more in the same afternoon.

I joined this engagement to help with a different problem — migrating their payment processing to a new provider. But the duplicate charge issue was on fire, so I got pulled in.

The setup that looked reasonable

The client ran a mid-size e-commerce platform on AWS. PostgreSQL on RDS, a Node.js API layer, the usual. Their order flow was straightforward: customer clicks "Place Order," the API creates an order record, charges the payment method, and updates the order status to "paid."

They'd been hitting read capacity limits on the primary database during peak hours. A senior engineer suggested adding a read replica and routing all SELECT queries there. The change was clean — they used a connection pool wrapper that directed writes to the primary and reads to the replica.

const db = {
  write: new Pool({ host: PRIMARY_HOST }),
  read: new Pool({ host: REPLICA_HOST }),
};
 
async function getOrder(orderId: string) {
  return db.read.query('SELECT * FROM orders WHERE id = $1', [orderId]);
}
 
async function createOrder(order: Order) {
  return db.write.query('INSERT INTO orders ...', [order]);
}

Simple. Tested in staging. Deployed with a feature flag. Gradually rolled out to 100% of traffic over a week. Metrics looked good the whole time.

The gap nobody measured

PostgreSQL streaming replication is fast. Under normal conditions, replica lag sits in the low milliseconds — often under 10ms. The team knew this. They'd checked pg_stat_replication during the rollout and saw lag values that were basically zero.

What they hadn't accounted for was what happens during a vacuum operation, a burst of writes, or when the replica is briefly catching up after a network hiccup. During those moments, lag could spike to 500ms or more. Sometimes a full second.

And here's where the order flow fell apart. Their frontend had an idempotency check: before creating a new order, the API would query for any existing order with the same cart ID to prevent duplicates.

async function placeOrder(cartId: string, paymentDetails: PaymentDetails) {
  // Check for existing order — this hits the REPLICA
  const existing = await getOrder(cartId);
  if (existing) return existing;
 
  // Create order — this hits the PRIMARY
  const order = await createOrder({ cartId, ...paymentDetails });
  await chargePayment(order.id, paymentDetails);
  return order;
}

If a customer double-clicked the "Place Order" button — which happens constantly — two requests would hit the API within 100ms of each other. The first request would write the order to the primary. The second request would check the replica for an existing order. But the replica was still 200ms behind. No existing order found. Second order created. Second charge processed.

The idempotency check was reading from the past.

Why it took three weeks to surface

Most of the time, replication lag was small enough that it didn't matter. The race window was tight. You needed a double-click during a moment of elevated lag. During off-peak hours with low write volume, the replica was nearly real-time, and the duplicate check worked fine.

But during peak traffic — exactly when write volume spiked and vacuum operations kicked in — lag increased, and so did the probability of the race condition. The team saw duplicate orders trickling in but initially attributed them to a frontend bug. They added a loading spinner to the button. It helped, but didn't eliminate the problem.

What finally made it obvious was a traffic spike during a flash sale. Replication lag hit 1.2 seconds. Forty-seven duplicate charges in one hour.

The fix wasn't "just read from primary"

The obvious answer — route the idempotency check back to the primary — would have worked. But it would also have put them right back where they started: hammering the primary with reads during peak traffic, which was the whole reason they'd added the replica.

We ended up with a more targeted approach. Reads that were part of a write path — where the result directly influenced whether a write should happen — went to the primary. Everything else stayed on the replica.

async function placeOrder(cartId: string, paymentDetails: PaymentDetails) {
  // Critical read: must hit primary to avoid stale data
  const existing = await db.write.query(
    'SELECT * FROM orders WHERE cart_id = $1', [cartId]
  );
  if (existing.rows.length > 0) return existing.rows[0];
 
  const order = await createOrder({ cartId, ...paymentDetails });
  await chargePayment(order.id, paymentDetails);
  return order;
}

We also added a UNIQUE constraint on cart_id in the orders table as a database-level safety net — something that should have been there from the start. Even if the application-level check failed, the insert would throw a conflict error instead of creating a duplicate.

Warning

If your idempotency check reads from a replica, it's not an idempotency check. It's a suggestion.

We audited the rest of the codebase for similar patterns. Found two more: an inventory reservation flow that could oversell during lag spikes, and a user registration endpoint that occasionally allowed duplicate email addresses. Same root cause, different symptoms.

The metrics that would have caught it

The team had been monitoring replication lag as a single number — average lag over 5-minute windows. That number was always under 50ms, which looked fine. But averages hide spikes. We switched to tracking p99 replication lag and set an alert at 200ms.

We also added a simple consistency check: after any critical write, log the primary's LSN (Log Sequence Number) and compare it to the replica's replay position. When the gap exceeded a threshold during a read-after-write, we'd log a warning. Within the first day, we had a clear picture of exactly which code paths were vulnerable.

The broader lesson

Read replicas are a legitimate scaling tool. I'm not arguing against them. But they change the consistency model of your application in ways that are easy to underestimate. "Eventually consistent" sounds fine until "eventually" is the 800ms window where your payment processor charges someone twice.

The questions worth asking before splitting reads to a replica: Which reads are part of a write decision? What happens if those reads are 500ms stale? Do you have database-level constraints backing up your application-level checks?

Most teams I've worked with can't answer all three off the top of their head. That's not a criticism — it's just not something you think about until it bites you.