An orderly EKS and Kubeflow upgrade path

2026-02-27

When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.

The worst time to discover platform ambiguity is when finance and timelines are both tightening.

Our first impulse was to ask, “how quickly can we upgrade?”

The better question was, “what order of operations prevents us from compounding hidden drift during upgrade churn?”

Why one-shot upgrades fail in controller-heavy stacks

On paper, “upgrade EKS then bump Kubeflow” sounds linear.

In reality, controller-heavy platforms fail in cross-layer ways:

control plane and node lifecycle churn cause restarts
restarts execute latent RBAC drift
reconcilers re-render from stale defaults
partial version alignment creates rollback confusion

You can pass upgrade mechanics and still fail platform behavior.

That’s how teams burn days while every individual subsystem claims success.

The pressure is real, but so is the blast radius

Cost pressure from extended support is legitimate. Waiting indefinitely is also a risk.

But urgency has a predictable trap: treating the upgrade as a calendar event instead of a behavior-preservation exercise.

If you compress timelines without adding gates, you don’t remove risk. You defer it into production restart churn, where every hidden assumption gets re-tested at once.

That is the expensive path even when the ticket closes quickly.

We used a gate-first sequence

The sequence that made sense for us was not novel. It was disciplined.

Freeze known-good baseline.
Rehearse full path in dev.
Promote the same gates to prod.

The key was what counts as “known-good.”

It was not only cluster version and pod health. It included reconcile inputs, RBAC assumptions, and controller restart behavior.

Step 1: freeze a reliability baseline

Before touching EKS versions, we captured two classes of state:

version/reconciler surface
controller permission surface

Practically, that meant snapshotting:

core component image tags
KFP component tags
reconcile-driving config values (appVersion, controller env overrides)
critical service-account can-i checks

This converts “works on my cluster” into auditable pre-upgrade evidence.

Step 2: prove restart safety in dev

Dev rehearsal is not just “does apply complete.”

It should answer: if critical controllers restart repeatedly during upgrade churn, do they still come back cleanly?

Required gate examples:

workflow-controller healthy after restart
kserve-controller-manager healthy after restart
ml-pipeline-ui and ml-pipeline-ui-artifact converge to intended images
sample pipeline execution completes
RBAC smoke checks all pass

If any gate fails in dev, it is not rehearsal noise. It is production prevention data.

Step 3: eliminate known drift classes before prod

We had drift classes that already proved they can bite under churn:

version identity split across runtime and reconcile defaults
RBAC subject mismatch across namespace placements
controller-managed children edited directly instead of source inputs

Those had to be normalized before prod promotion.

The principle was straightforward:

do not carry known drift into an upgrade that will maximize restart and reconcile activity.

Step 4: define explicit rollback posture before rollout

Most teams define rollback after they need it.

We treat rollback as part of forward planning:

keep previous node group capacity available until controller gates pass
preserve prior manifest/config snapshots for quick re-apply
define stop conditions per phase (what failure blocks progression)

This is not bureaucracy. It is latency control during incidents.

Without explicit rollback posture, every failure becomes a bespoke decision under pressure.

Step 5: run prod as staged promotion, not migration theater

For production, we treat each phase as a gate boundary:

control plane upgrade
managed add-ons alignment
node group rotation
platform-level verification

At each boundary, we re-run the same evidence path from dev:

version snapshot
RBAC smoke checks
critical controller health
one real workflow/pipeline validation

No pass, no next phase.

That sounds conservative. It is faster than rollback archaeology.

What changed in our planning posture

We stopped discussing upgrades as version transitions and started discussing them as behavior-preservation projects.

Version movement is necessary.

Behavior preservation is the goal.

In this framing, the most valuable artifacts are not upgrade scripts. They are repeatable assertions about what must still be true after restart and reconcile.

The 10-part orderly upgrade path

Freeze a known-good baseline (runtime images + reconciler inputs).
Define a target compatibility matrix (EKS, Kubeflow, KServe, KFP).
Rehearse the full sequence in dev, end to end.
Standardize source-of-truth manifests before touching prod.
Upgrade control plane first, then add-ons, then node groups.
Run hard gates after every phase (controllers, RBAC checks, one real pipeline).
Remove duplicate controller placements and namespace ambiguity.
Predefine rollback posture and stop conditions before rollout.
Promote to prod in staged gates, never one-shot migration theater.
Re-snapshot and drift-audit after completion to reset baseline.

The practical checklist

If I had to compress this into one checklist for teams under support-cost pressure:

capture runtime + reconciler version matrix before touching EKS
codify RBAC smoke checks for controller service accounts
validate restart safety of critical controllers in dev
align prod/dev binding subjects and reconcile defaults before promotion
predefine rollback posture and stop conditions
promote with explicit gates, not calendar urgency

Cost pressure should accelerate discipline, not bypass it.

The line I keep

An orderly upgrade is not about being cautious.

It is about making restart behavior and reconciler truth explicit before the platform makes them explicit for you.

#eks #kubernetes #kubeflow #reliability #platform-engineering