Our AWS Bill Went Up 40% and Nobody Noticed for Three Months

cloud devops consulting cost-optimization architecture

A consulting engagement where we finally opened the cloud bill and found forgotten dev environments, runaway log storage, and a data pipeline reprocessing everything from scratch every night.

The finance team flagged it. Not engineering. Not ops. The person who noticed that AWS spend had climbed from $38K/month to $53K/month was an accountant who saw the line item on a quarterly review and asked, "Is this right?"

It was right. And nobody on the engineering side had any idea.

Nobody owned the bill

This was a 30-person engineering org building a B2B analytics platform. They had a solid product, decent architecture, and a deploy pipeline that worked. What they didn't have was anyone whose job it was to look at the cloud bill. The CTO had access to the AWS console. He'd glance at it "every few months." The infrastructure lead had set up cost alerts at $45K, but those emails were going to a shared inbox that nobody checked after the original infra engineer left six months earlier.

I've seen this pattern at maybe half the mid-size companies I've worked with. The bill is everyone's problem in theory, which means it's nobody's problem in practice.

Where the money was actually going

We spent two days tagging resources and mapping costs. The breakdown was almost comical in hindsight.

Forgotten dev environments: ~$4,200/month. Three complete staging-like environments running 24/7 that belonged to projects that had been cancelled or shipped months ago. One was a proof-of-concept from the previous quarter that still had an RDS instance, two ECS services, and an ElastiCache cluster running. The engineer who'd spun it up had left the company.

Log storage: ~$3,800/month. CloudWatch log groups with no retention policy. Every log group was set to "never expire," which is the default. Eighteen months of debug-level logs from every service, just sitting in CloudWatch, quietly accumulating storage charges. The irony: nobody ever queried logs older than a week.

The nightly data pipeline: ~$5,100/month. This was the big one. The analytics ingestion pipeline ran every night at 2 AM. At some point — nobody could pinpoint exactly when — someone had changed it from incremental processing to a full reprocessing of the entire dataset. Every night it spun up a fleet of spot instances, pulled the complete history from S3, transformed it, and loaded it into Redshift. The original incremental version processed maybe 50K records per run. The full reprocess was doing 14 million.

The commit message that introduced this change said "fix: ensure data consistency." Classic.

Oversized RDS instances: ~$2,400/month. The production database was running on an r6g.2xlarge. Average CPU utilization: 8%. Average memory usage: 12%. It had been sized for a traffic projection from two years ago that never materialized. Nobody had revisited it because the database "worked fine," and downsizing felt like unnecessary risk.

Note

The total waste was roughly $15,500/month — about 29% of the entire bill. None of it was caused by traffic growth or new features. It was all drift.

The fixes were boring

Setting CloudWatch log retention to 30 days: one CLI command per log group. We wrote a script that hit all 94 log groups in about two minutes. Immediate savings.

Tearing down the abandoned environments took half a day, mostly because we had to verify with three different people that yes, these really were dead projects, and no, nobody needed that data.

Right-sizing the RDS instance required a maintenance window and some careful testing, but we dropped to an r6g.large and the database didn't even notice.

The pipeline fix was the most involved. We reverted to incremental processing, but also added a checksum-based validation step so the "data consistency" concern that prompted the full reprocess was actually addressed. Took about a day and a half of engineering time.

What we put in place

Two things, neither of them fancy.

First, we set up AWS Cost Explorer with daily email summaries to the infrastructure lead and CTO. Not a third-party tool, not a dashboard nobody would check — just a daily email with a cost graph and the top five services by spend. The kind of thing you glance at over coffee.

Second, we added a tagging policy. Every resource needs a team, project, and environment tag. Untagged resources get flagged in a weekly Slack report. It's not enforced at the infrastructure level yet — that felt like too much friction to introduce all at once — but the visibility alone changed behavior. Engineers started cleaning up after themselves when they could see their name next to the cost.

The uncomfortable part

None of this was technically hard. Every fix was something a mid-level engineer could do in an afternoon. The problem wasn't capability. It was that cloud costs are invisible until someone makes them visible, and most engineering teams are incentivized to ship features, not to look at bills.

The $15K/month we saved was nice. But the real value was the team internalizing that infrastructure has a price tag, and that "it's running fine" and "it's running efficiently" are two very different statements.

I keep thinking about how long those forgotten environments would have kept running if the accountant hadn't asked. Probably until someone decommissioned the AWS account entirely.