The $173 Training Run

2026-03-20

The Slack message landed at 3pm on a Wednesday: “model training successful, previously 20min, now 1h30m.” I had finished an EKS 1.32-to-1.33 upgrade on the ramparts cluster that morning. My upgrade, my timeline, my problem.

The first theory wrote itself. New cluster version, fresh nodes, cold image caches. I’d fixed a broken cluster autoscaler earlier that day — the old autoscaler deployment was pinned to a node selector that no longer matched after the upgrade, so pods were stacking up in Pending until I caught it. First-run penalties after a major version bump are real. Everyone on the call nodded. I almost typed up that explanation and moved on.

But I pulled the Kubeflow pipeline run details instead. Two runs side by side — last week’s healthy run and today’s slow one. The task breakdown told a different story:

Task                           Before     After
─────────────────────────────  ─────────  ─────────
data-preparation               1m 12s     1m 15s
feature-engineering             2m 04s     2m 08s
train-and-create-artifact       9m 24s    83m 36s
evaluate-model                  0m 48s     0m 51s
register-model                  0m 33s     0m 35s

Five pipeline steps, four of them within seconds of their previous times. The infrastructure — the nodes, the scheduler, the network, the storage — was identical across all five. If the EKS upgrade had degraded something systemic, every step would have drifted. Instead, one step ballooned from 9 minutes to 83 minutes and everything else held steady. Infrastructure was exonerated in six rows.

So I diffed the pipeline parameters between the two runs. One line jumped out:

- llm_model_name: bedrock/claude-3-7-sonnet-v1
+ llm_model_name: us.anthropic.claude-opus-4-6-v1

Someone had switched the training pipeline from Sonnet 3.7 to Opus 4.6 between runs. No announcement, no PR comment, no Slack thread. The parameter just changed in the next commit to the pipeline config, and the scheduled training run picked it up.

I had my culprit — or so I thought. The second theory was even more intuitive than the first: Opus is a larger, more capable model. Larger models generate slower responses. A 9x slowdown from upgrading the model seemed perfectly reasonable. I could feel myself reaching for the keyboard to type “the model switch explains the latency, this is expected behavior” and close the loop.

I ran the benchmark instead. Ten calls each to Sonnet 3.7, Sonnet 4.6, and Opus 4.6, through the same LiteLLM proxy the training pipeline uses, with the same prompt format and token budget. The results:

Sonnet 3.7:  avg 0.24s/call  (σ 0.03)
Sonnet 4.6:  avg 0.22s/call  (σ 0.04)
Opus 4.6:    avg 0.28s/call  (σ 0.05)

All three models respond in under 300 milliseconds for the call pattern this pipeline uses. Opus is not meaningfully slower per-call than Sonnet. The second obvious explanation was wrong too.

Two theories down. The timing table said the problem was in train-and-create-artifact. The parameter diff said the problem correlated with the model switch. The benchmark said per-call latency was not the mechanism. Something else was multiplying the wall-clock time, and it was not the speed of individual responses.

I went to CloudWatch. Bedrock logs every invocation, and you can filter by model ID and time range:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name Invocations \
  --dimensions Name=ModelId,Value=anthropic.claude-3-7-sonnet-20250219-v1:0 \
  --start-time 2026-03-17T00:00:00Z \
  --end-time 2026-03-17T23:59:59Z \
  --period 86400 \
  --statistics Sum

aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name Invocations \
  --dimensions Name=ModelId,Value=anthropic.claude-opus-4-6-20250918-v1:0 \
  --start-time 2026-03-18T00:00:00Z \
  --end-time 2026-03-18T23:59:59Z \
  --period 86400 \
  --statistics Sum

Two queries, one for Monday’s Sonnet run and one for Tuesday’s Opus run. The numbers came back:

Sonnet run: 338 invocations
Opus run: 6,999 invocations

Not a latency problem. A volume problem. The training pipeline was making twenty times more API calls when Opus was the backend model. Same pipeline code, same training data, same task — but the LiteLLM proxy caches responses with a 24-hour TTL, and the cache keys include the model name. Every prompt that was a cache hit under Sonnet became a cache miss under Opus. The 338 Sonnet invocations were the uncached fraction of a much larger request volume. The 6,999 Opus invocations were the same volume with zero cache hits because the cache was warm for a model that was no longer being called.

The cost math followed immediately. Opus input tokens price at roughly 3x Sonnet, and Opus output tokens price at roughly 5x Sonnet. Combine the per-token price increase with a 20x cache-miss volume increase:

Sonnet run: 338 calls, ~$5 per training run
Opus run: 6,999 calls, ~$173 per training run

The momcorp tenant’s training pipeline runs daily. At $5 per run with Sonnet, that’s $150 per month in Bedrock costs — a rounding error on the AWS bill. At $173 per run with Opus, that’s $5,200 per month. A one-line parameter change, committed without discussion, moved the Bedrock spend by 34x.

I went back to the Slack thread and posted the actual findings: the EKS upgrade was not the cause, the slowdown was isolated to a single pipeline step, and the root cause was a model parameter change that invalidated the LiteLLM cache and sent every request to Bedrock uncached. I included the CloudWatch numbers and the cost projection. The model switch got reverted within the hour.

Three wrong theories in sequence, each one more intuitive than the last. The cluster did it — except four out of five pipeline steps were unaffected. The bigger model is slower — except per-call latency was identical. The model is more expensive per token — true, but that was 3-5x, not 34x. The real multiplier was cache invalidation, which nobody would have guessed without pulling the CloudWatch metrics.

The forensic sequence that got me there took about 45 minutes: pull the task breakdown, diff the parameters, benchmark the models, count the invocations. Each step eliminated one theory and pointed at the next. If I’d stopped at the first plausible explanation — my own EKS upgrade — I’d have spent days investigating infrastructure that was working fine, and the $173 training runs would have kept racking up in the background.

When an LLM-integrated pipeline slows down, don’t start with infrastructure. Start with diff. And when you find the model changed, don’t assume the latency changed — count the calls. The cost of an LLM pipeline is calls times tokens times price, and any of those three can move independently. But the one you’ll miss is the cache — a model switch doesn’t just change the price per call, it resets the cache and changes the number of calls.

The right answer took 45 minutes. The wrong answer would have cost $5,050 per month.

#aws #bedrock #kubeflow #litellm #mlops #finops #platformengineering