ferkakta.dev

The $233 Day, Part 2: The Inference Iceberg

I posted the part 1 findings to the team thread — model switch, cache invalidation, 20× call volume, $173 training run. Case closed. The numbers were clean, the explanation was satisfying, and the model got reverted within the hour.

Except $173 was wrong. Not wrong in the analysis — the training run did cost that much. Wrong in scope. I’d found the visible part of the spend and stopped looking.

The CloudWatch daily aggregate for that Opus day showed 6,999 invocations. I’d attributed all of them to the training pipeline because the training pipeline was the thing I was investigating. But when I broke the same metric into hourly buckets, the distribution told a different story.

The training run happened between 03:58 and 05:31 UTC. During that ninety-minute window, CloudWatch recorded roughly 545 Opus invocations. Substantial, expensive, and entirely consistent with the cache-miss behavior I’d diagnosed in part 1. But 545 is not 6,999. The remaining 6,400 calls arrived in a fat cluster between 10:00 and 13:00 UTC — five to eight hours after training finished, right in the middle of India business hours. The training pipeline had been idle for hours by then.

I pulled pod logs across every service on the cluster for that window. One service had activity: the inference API, with 1,281 log lines during the heavy period. Every other service was silent. The logs showed entity matching inference running — the same entity matching system that trained with Opus overnight, but now serving live requests through the same model.

This is where the shared config did its damage. The model parameter in the orchestrator project configuration is a single field. Training reads it. Inference reads it. They’re different code paths in the same application, pulling from the same config source. When someone switched the model to Opus for training, they switched it for inference too. There was no separate knob, no override, no indication that the field served double duty. The engineer who changed the training model had no reason to think they were also changing the production inference model, because the config didn’t surface that relationship.

I’ve made this exact kind of mistake. A config value that looks like it governs one thing but quietly governs three — you change it for the reason you can see and break things for the reasons you can’t. The problem isn’t carelessness. The problem is that a shared field with two consumers looks identical to a dedicated field with one consumer until the moment it doesn’t.

The cost split, once I had the hourly data and the per-call pricing, was brutal. Training accounted for about $14 of the day’s Bedrock bill. Health checks and miscellaneous overhead added another $9. Inference — the 6,400 calls I hadn’t been looking at — cost $210. I’d spent an hour forensically dissecting the $14 problem while the $210 problem ran in the background, generating invoices in real time.

The training run from part 1, the one I’d written up as a $173 disaster, was 6% of the actual spend. The inference iceberg underneath it was the other 94%.

Once I understood the inference path, I followed the model downstream. The orchestrator service doesn’t just run inference directly — it feeds results through a set of entity matching lambdas. The ramparts dev account has 32 of these lambdas. The prod account has another 18. Every one of them has the LiteLLM bearer token baked into its environment variables at deploy time. One token, 50 lambdas, two accounts.

LiteLLM sits between every service and Bedrock as a shared proxy. It handles routing, caching, and rate limiting. It does not handle per-caller attribution. There is one API key. Every service — training pipelines, inference apps, lambdas, health checks — authenticates with the same bearer token. When the CloudWatch bill arrives, you can see the total, and you can see the per-model breakdown, but you cannot see which service made which calls without cross-referencing timestamps against pod logs and lambda invocation records. I was doing that cross-referencing manually because no tooling existed to do it automatically.

Swapping that shared token — say, rotating it after a leak or segmenting it for attribution — means redeploying all 50 lambdas across both accounts. The token is not fetched at runtime from a secret store. It’s stamped into the lambda configuration at deploy time and stays there until the next deploy. A shared secret with a 50-service blast radius and no rotation path shorter than a full redeployment sweep.

Every “shared” in this stack was a blast radius multiplier. The shared model parameter meant a training change propagated to inference. The shared LiteLLM token meant every service looked identical in the billing data. The shared config field meant there was no way to give training Opus and inference Sonnet simultaneously without restructuring the project config. One commit touched one line in one config file, and the cost impact propagated through four surfaces: training config, inference config, the orchestrator’s model registry, and 50 lambda environment variables.

The fix I proposed isn’t monitoring or alerts — those are after-the-fact. The fix is zero-trust model budgets. Each service gets its own LiteLLM API key with an explicit model allocation. If your key is authorized for Sonnet 4.6, that’s what you can call. Opus 4.6 returns a 403 unless your key has an explicit Opus budget. The default allocation for any unassigned model is $0. Switching a model from Sonnet to Opus isn’t a config change anymore — it’s a budget reallocation that forces a conversation about cost before the first call lands.

There’s a second piece worth building: a cache grace period. When a model switch invalidates the cache, the proxy could serve stale cached responses from the old model for a configurable window — say, an hour — while the new model’s cache warms up. You’d take a quality delta during the grace period, but you wouldn’t take a 20× volume spike. For inference workloads where yesterday’s Sonnet answer is better than no cache at all, this turns a cliff into a ramp.

Neither of these existed when I started the investigation. The forensic sequence from part 1 — pull the data, measure before concluding — almost failed in part 2 because I found a satisfying answer and stopped. The training run was real, expensive, and fully explained. It was also a distraction from the larger problem that was only visible in the hourly breakdown I almost didn’t pull.

The $173 training run from part 1 was $14. The $233 day was $210 in inference, $14 in training, and $9 in health checks. The number I reported to the team was off by a factor of fifteen because I investigated the anomaly I could see and missed the steady-state cost I wasn’t looking for.

Measure twice. Then measure something else. The cost of an LLM-integrated platform is not the cost of the thing that set off the alert — it’s the cost of every service that shares the same config, the same key, and the same model, running at steady state where nobody is watching.

The wrong answer cost $233 in one day. The right answer took pulling one more CloudWatch query.

#aws #bedrock #litellm #mlops #finops #platformengineering