We Added Tracing and the Architecture Diagram Was Wrong

A client was confident about how their services talked to each other. Then we instrumented the system with OpenTelemetry and found out what was actually happening.


The architecture diagram was on the wall in the team's meeting room. Clean boxes, labeled arrows, a nice left-to-right flow. API gateway to order service, order service to inventory and payment, payment to notification. The kind of diagram you draw during the first week and never update again.

I was brought in to help with intermittent latency spikes — requests that occasionally took 8 to 12 seconds instead of the expected 200ms. The team had metrics. They had logs. What they didn't have was a way to follow a single request through the system end to end.

So we added OpenTelemetry. And the first waterfall trace I pulled up looked nothing like the diagram on the wall.

Setting it up was the easy part

The system was seven Node.js services running on ECS, talking to each other over HTTP and SQS. We started with the auto-instrumentation package, which took maybe an afternoon to roll out across all services.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
 
const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
 
sdk.start();

We piped everything to Grafana Tempo and had traces flowing within a day. The setup was genuinely painless. What came next was not.

The hidden round trip

The first surprise was in the order service. According to the diagram, it called inventory once per request to check stock levels. The traces showed it calling inventory twice — once at the start of the flow and once right before persisting the order. Someone had added a second validation check "just to be safe" about a year ago. Nobody remembered doing it. The git blame pointed to a PR with the description "minor safety check."

That second call added 60 to 80ms to every single order request. Not catastrophic on its own, but it was completely unnecessary — the inventory state hadn't changed between the two calls, which were 40ms apart.

The service that called itself

The notification service had a webhook handler that, under certain conditions, triggered a new notification. That notification then hit the same webhook endpoint. It wasn't infinite — there was a type check that prevented recursion beyond one level — but the traces showed this self-call happening on about 30% of notification events. Each round trip added 150ms and a second database write that was immediately overwritten by the subsequent update.

The team had no idea this was happening. The self-call wasn't in the architecture diagram. It wasn't in anyone's mental model. It was just there, burning CPU and database connections, invisible until we could see the actual call graph.

The real source of the latency spikes

But the intermittent 8-second requests — the reason I was there in the first place — turned out to be something else entirely.

The payment service called a third-party fraud detection API on every transaction. Most of the time, it responded in under 100ms. But roughly 2% of requests hit what appeared to be a cold-start penalty on the vendor's side — response times ballooned to 7 or 8 seconds. The payment service had a 10-second timeout, so the requests technically succeeded. They were just painfully slow.

Without traces, this was nearly impossible to spot. The payment service's average latency looked fine. Its P95 looked fine. Even its P99 was only slightly elevated. You had to look at individual request waterfalls to see that specific trace where one span — the fraud check — sat there for 7,400ms while everything else completed in under 200ms.

// Before: no timeout, no fallback
const fraudResult = await fraudClient.check(transaction);
 
// After: 2-second timeout with score-based bypass
const fraudResult = await Promise.race([
  fraudClient.check(transaction),
  timeout(2000).then(() => ({ 
    decision: 'allow', 
    reason: 'timeout_bypass',
    requiresManualReview: true 
  })),
]);

We added a 2-second timeout on the fraud check with a fallback that allowed the transaction through but flagged it for manual review. The client's fraud team reviewed the numbers and agreed — the 2% of slow checks weren't catching more fraud than the fast ones. They were just slow.

What the diagram should have shown

After a week of tracing, we redrew the architecture diagram. It had more arrows. Some of them were surprising. The notification service's self-call. A health check endpoint on the API gateway that, due to a misconfigured route, was proxying through the order service on every probe — adding 50 requests per minute of unnecessary internal traffic. A logging middleware that made a synchronous HTTP call to an internal audit service on every single request, adding 15ms of baseline latency to everything.

None of these showed up in any existing monitoring. They weren't errors. They weren't in any alert threshold. They were just... friction. The kind of friction that accumulates until someone asks "why does this feel slow?" and nobody has a good answer.

The uncomfortable takeaway

Most teams I work with have an architecture diagram. Almost none of them are accurate. Not because anyone lied — because systems drift. Someone adds a call, someone else adds a retry, a middleware gets inserted, a health check gets misconfigured. Each change is small. Over two years, the actual system and the mental model of the system diverge significantly.

Distributed tracing isn't just a debugging tool. It's a way to find out what your system actually does, as opposed to what you think it does. The setup cost with OpenTelemetry in 2026 is genuinely low — auto-instrumentation covers most of the common libraries, and managed backends like Tempo or Honeycomb handle the storage. The harder part is being willing to look at what the traces reveal and act on it.

That fraud check timeout fix took us about thirty minutes to implement. The unnecessary double inventory call was a one-line delete. The notification self-call was a small refactor. Combined, those changes dropped the P99 from 8.2 seconds to 380ms and reduced internal traffic by about 15%. All from problems that were invisible before we could follow a request from end to end.

I keep thinking about that diagram on the meeting room wall. It's probably still there.