ferkakta.dev

An orderly EKS and Kubeflow upgrade path

When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.

The worst time to discover platform ambiguity is when finance and timelines are both tightening.

Our first impulse was to ask, “how quickly can we upgrade?”

The better question was, “what order of operations prevents us from compounding hidden drift during upgrade churn?”

Why one-shot upgrades fail in controller-heavy stacks

On paper, “upgrade EKS then bump Kubeflow” sounds linear.

In reality, controller-heavy platforms fail in cross-layer ways:

You can pass upgrade mechanics and still fail platform behavior.

That’s how teams burn days while every individual subsystem claims success.

The pressure is real, but so is the blast radius

Cost pressure from extended support is legitimate. Waiting indefinitely is also a risk.

But urgency has a predictable trap: treating the upgrade as a calendar event instead of a behavior-preservation exercise.

If you compress timelines without adding gates, you don’t remove risk. You defer it into production restart churn, where every hidden assumption gets re-tested at once.

That is the expensive path even when the ticket closes quickly.

We used a gate-first sequence

The sequence that made sense for us was not novel. It was disciplined.

  1. Freeze known-good baseline.
  2. Rehearse full path in dev.
  3. Promote the same gates to prod.

The key was what counts as “known-good.”

It was not only cluster version and pod health. It included reconcile inputs, RBAC assumptions, and controller restart behavior.

Step 1: freeze a reliability baseline

Before touching EKS versions, we captured two classes of state:

Practically, that meant snapshotting:

This converts “works on my cluster” into auditable pre-upgrade evidence.

Step 2: prove restart safety in dev

Dev rehearsal is not just “does apply complete.”

It should answer: if critical controllers restart repeatedly during upgrade churn, do they still come back cleanly?

Required gate examples:

If any gate fails in dev, it is not rehearsal noise. It is production prevention data.

Step 3: eliminate known drift classes before prod

We had drift classes that already proved they can bite under churn:

Those had to be normalized before prod promotion.

The principle was straightforward:

do not carry known drift into an upgrade that will maximize restart and reconcile activity.

Step 4: define explicit rollback posture before rollout

Most teams define rollback after they need it.

We treat rollback as part of forward planning:

This is not bureaucracy. It is latency control during incidents.

Without explicit rollback posture, every failure becomes a bespoke decision under pressure.

Step 5: run prod as staged promotion, not migration theater

For production, we treat each phase as a gate boundary:

  1. control plane upgrade
  2. managed add-ons alignment
  3. node group rotation
  4. platform-level verification

At each boundary, we re-run the same evidence path from dev:

No pass, no next phase.

That sounds conservative. It is faster than rollback archaeology.

What changed in our planning posture

We stopped discussing upgrades as version transitions and started discussing them as behavior-preservation projects.

Version movement is necessary.

Behavior preservation is the goal.

In this framing, the most valuable artifacts are not upgrade scripts. They are repeatable assertions about what must still be true after restart and reconcile.

The 10-part orderly upgrade path

  1. Freeze a known-good baseline (runtime images + reconciler inputs).
  2. Define a target compatibility matrix (EKS, Kubeflow, KServe, KFP).
  3. Rehearse the full sequence in dev, end to end.
  4. Standardize source-of-truth manifests before touching prod.
  5. Upgrade control plane first, then add-ons, then node groups.
  6. Run hard gates after every phase (controllers, RBAC checks, one real pipeline).
  7. Remove duplicate controller placements and namespace ambiguity.
  8. Predefine rollback posture and stop conditions before rollout.
  9. Promote to prod in staged gates, never one-shot migration theater.
  10. Re-snapshot and drift-audit after completion to reset baseline.

The practical checklist

If I had to compress this into one checklist for teams under support-cost pressure:

Cost pressure should accelerate discipline, not bypass it.

The line I keep

An orderly upgrade is not about being cautious.

It is about making restart behavior and reconciler truth explicit before the platform makes them explicit for you.

#eks #kubernetes #kubeflow #reliability #platform-engineering