The "Harmless" API Change That Broke Four Services

A consulting story about a minor field rename in an internal API that cascaded into a production incident, and what we put in place to stop it from happening again.


A small rename

The ticket looked innocent enough. A developer on the orders team renamed a JSON field from totalPrice to total_price in their service's response payload. The PR got two approvals. The change aligned with the team's new snake_case convention. It shipped on a Tuesday afternoon.

By Wednesday morning, four downstream services were silently dropping data.

I was three weeks into an engagement with this client — a mid-size fintech company running about 30 microservices. My brief was to help them untangle some performance issues in their checkout flow. Instead, I spent most of that week helping them understand why their order confirmation emails had stopped including totals, why the analytics pipeline was logging nulls, and why the mobile app was showing "$0.00" on the order summary screen.

How one field became four fires

The orders service was the source of truth for purchase data. Four other services consumed its API:

None of these consumers had been consulted about the rename. None of them broke loudly — that was the real problem. The email service used optional chaining and fell back to an empty string. The analytics pipeline treated missing fields as null and kept writing rows. The mobile BFF had a default of 0 for missing numeric fields. The billing service, mercifully, had its own price calculation and only used this field for a consistency check that was behind a feature flag nobody had turned on.

So nothing crashed. No alerts fired. The incident was discovered because a customer service rep noticed that order confirmation emails looked wrong.

The uncomfortable part

The developer who made the change wasn't careless. They'd checked the service's own test suite, which passed. They'd updated the API documentation. They did what most of us would do in a codebase where "just rename it and update the docs" feels like the responsible thing.

The real problem was structural. This team had:

  • No contract tests between services
  • No schema registry for internal APIs
  • No automated way to know who consumed what
  • A Confluence page titled "Service Dependencies" that was last updated eight months ago

They had monitoring, good monitoring actually — Datadog dashboards, alerts on error rates and latency. But nothing that checked the shape of data flowing between services. You can have 100% uptime and zero errors while silently producing garbage output.

Warning

Silent failures are worse than loud ones. If your downstream services swallow missing fields without complaint, you won't know something is broken until a human notices the symptoms.

What we put in place

We didn't try to boil the ocean. The team had real feature work to deliver, so we picked targeted changes that would prevent this class of problem.

Consumer-driven contract tests. We introduced Pact for the three highest-traffic internal APIs, starting with the orders service. Each consumer defines what fields it expects, and those expectations run as part of the provider's CI pipeline. If the orders team tries to rename totalPrice, the build fails before it reaches code review.

# pact-test for email-service consuming orders-service
interaction:
  description: "a request for order details"
  request:
    method: GET
    path: /orders/123
  response:
    status: 200
    body:
      orderId: "123"
      totalPrice: 49.99  # consumer expects this field
      currency: "USD"

A lightweight schema registry. Nothing fancy — a shared Git repo with JSON Schema files for each service's public API. Services reference a specific schema version in their CI, and a GitHub Action validates that new deployments don't break the published schema. It took an afternoon to set up.

Explicit deprecation windows. We added a simple rule: no field removal or rename without a two-week deprecation period. During that window, the API returns both the old and new field names. Consumers get a warning header (Deprecation: true) so teams can grep their logs and find what needs updating.

// Deprecation bridge - serve both field names during transition
function formatOrderResponse(order: Order) {
  return {
    orderId: order.id,
    total_price: order.totalPrice,
    totalPrice: order.totalPrice, // deprecated, remove after 2026-04-15
    currency: order.currency,
  };
}

What I'd push harder on next time

If I could rewind, I'd have started with a service dependency graph on day one. Not the Confluence page — an actual runtime dependency map generated from traffic data. Tools like Kiali (if you're on Kubernetes) or even just parsing access logs can show you who talks to whom. You can't protect contracts you don't know exist.

I'd also push for stricter deserialization on the consumer side. The reason this incident was silent is that every consumer was lenient about missing fields. That's a reasonable default for external APIs where you don't control the producer, but for internal services? If a field you depend on disappears, that should be an error, not a fallback to zero.

// Don't do this for internal APIs
const total = response.totalPrice ?? 0;
 
// Do this instead
if (response.totalPrice === undefined) {
  throw new Error(`Missing required field: totalPrice from orders-service`);
}

The pattern underneath

This story isn't really about a field rename. It's about the gap between how teams think about service boundaries and how those boundaries actually behave at runtime. Every microservices team I've worked with has some version of this gap. Services that are "independent" in the architecture diagram are deeply coupled through the data shapes they pass around.

Contract testing closes that gap, but only for the contracts you know about. The deeper fix is cultural: treating internal API changes with the same care you'd give to a public API. Your consumers might all be on the same Slack workspace, but their code still breaks the same way an external customer's would.

What's your team's approach to internal API changes — do you have any guardrails, or is it still mostly trust and Confluence?