We Instrumented Our LLM Calls and Found the Budget Leak

observability ai opentelemetry consulting debugging

A client's AI features were burning through their OpenAI budget 3x faster than projected. Adding OpenTelemetry's GenAI semantic conventions revealed the problem wasn't what anyone expected.

The Slack message came in at 9 AM on a Tuesday: "Our OpenAI bill last month was $47k. We budgeted $15k. Can you look at this?"

The client was a Series B SaaS company that had shipped three AI-powered features over the previous quarter. A document summarizer, a customer support chatbot, and an internal search tool that used RAG over their knowledge base. All three hit GPT-4o through a shared wrapper service. None of them had any observability beyond basic request counting in Datadog.

They knew how many requests went out. They had no idea what was inside them.

The black box problem

Here's what their existing monitoring told them: the AI service handled roughly 200,000 requests per day. Average latency was 2.1 seconds. Error rate was under 0.5%. Everything looked healthy from the outside.

But "200,000 requests" tells you almost nothing when each request can vary from 500 tokens to 128,000 tokens. It's like monitoring a shipping company by counting trucks without knowing what's in them.

I'd been reading about OpenTelemetry's GenAI semantic conventions — they'd been stabilizing through early 2026 and now that OTel graduated from the CNCF last month, the tooling around them has gotten genuinely usable. The conventions standardize how you record LLM interactions: model name, token counts (input and output), tool calls, and optionally the full prompt and completion content.

We decided to instrument properly rather than guess.

Adding the instrumentation

Their AI wrapper was a Python FastAPI service. The OpenTelemetry Python SDK already had auto-instrumentation for the OpenAI client library. Setup took about an hour:

from opentelemetry.instrumentation.openai import OpenAIInstrumentor
 
OpenAIInstrumentor().instrument(
    capture_content=False,  # Don't log prompts in prod initially
)

We configured it to ship spans to their existing Grafana Tempo instance. Each LLM call now produced a span with attributes like gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, and gen_ai.operation.name.

Within a day we had enough data to see the problem clearly.

The culprit wasn't the chatbot

Everyone assumed the customer support chatbot was the cost driver. It handled the most requests — about 140,000 per day — and it was customer-facing, so it felt like the obvious suspect.

It wasn't. The chatbot averaged 800 input tokens and 200 output tokens per request. Reasonable. Well-optimized prompts, short conversation histories, sensible context windows.

The document summarizer was the problem. It only handled 12,000 requests per day, but the p95 input token count was 94,000. Not because the documents were that long — most were 3-5 pages — but because of a bug in how context was assembled.

The summarizer had a "related documents" feature that was supposed to include the two most relevant sibling documents for context. A query logic error in the retrieval step meant it was pulling all sibling documents in the same workspace. For one enterprise customer with 800+ documents in a shared workspace, every single summarization request was stuffing the entire workspace into the prompt.

# What it was supposed to do
siblings = get_related_documents(doc_id, limit=2)
 
# What it actually did (the limit parameter wasn't being passed through)
def get_related_documents(doc_id, limit=None):
    query = select(Document).where(
        Document.workspace_id == doc.workspace_id,
        Document.id != doc_id
    )
    # limit was always None due to the caller not using keyword arg
    if limit is not None:
        query = query.limit(limit)
    return session.execute(query).scalars().all()

The caller was passing limit as a positional argument, but the function signature had doc_id as the only positional parameter. So limit=2 was silently being ignored. Python didn't complain because limit had a default value.

Warning

This is why keyword-only arguments exist. If that function had used *, limit=None in its signature, the positional call would have raised a TypeError immediately instead of silently falling through.

The numbers told the story

Once we had per-feature token breakdowns, the math was obvious:

Chatbot: 140k requests/day × ~1,000 tokens avg = 140M tokens/day
Search: 48k requests/day × ~2,000 tokens avg = 96M tokens/day
Summarizer: 12k requests/day × ~60,000 tokens avg = 720M tokens/day

The summarizer was responsible for 75% of their token consumption while handling only 6% of their request volume. The request count metric had been actively misleading them.

What we changed beyond the fix

The one-line fix (get_related_documents(doc_id, limit=2) changed to get_related_documents(doc_id, limit=2) — actually we refactored to use keyword-only args) brought the bill projection down to $14k/month immediately. But the more valuable outcome was the observability layer itself.

We set up three things that stuck:

Token budget alerts per feature. Each AI feature got a daily token budget based on expected usage patterns. If the summarizer crosses 200M tokens in a day, someone gets paged. Not because it's necessarily broken, but because it warrants a look.

Cost attribution dashboards. Grouped by feature, by customer tier, by model. The product team could finally answer "how much does the AI cost us per enterprise customer?" Turns out it varied from $0.80/month to $340/month depending on workspace size — information that matters for pricing decisions.

Prompt size percentile tracking. p50, p95, p99 for input tokens per feature. A slow drift upward in p95 is often the first sign that context assembly logic is degrading — maybe a new document type isn't being chunked properly, or a conversation history isn't being truncated.

The broader lesson

We've spent years building observability muscle for traditional services. We know to track latency percentiles, error rates, queue depths. But many teams are shipping AI features with 2015-era monitoring — counting requests and calling it done.

LLM calls are not like database queries. A single request can cost anywhere from $0.001 to $2.00 depending on what's in it. The variance is enormous, and the cost is directly proportional to something (token count) that most teams don't track at all.

OpenTelemetry's GenAI conventions aren't perfect yet — the semantic conventions for streaming responses and multi-turn agents are still evolving. But the basics are there, they work, and they would have caught this client's $32k/month budget leak on day one if they'd been in place from the start.

If you're shipping LLM features without per-call token tracking, you're flying blind in a way that directly costs money. How long until someone notices?