An orderly EKS and Kubeflow upgrade path
When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.
The worst time to discover platform ambiguity is when finance and timelines are both tightening.
Our first impulse was to ask, “how quickly can we upgrade?”
The better question was, “what order of operations prevents us from compounding hidden drift during upgrade churn?”
Why one-shot upgrades fail in controller-heavy stacks
On paper, “upgrade EKS then bump Kubeflow” sounds linear.
In reality, controller-heavy platforms fail in cross-layer ways:
- control plane and node lifecycle churn cause restarts
- restarts execute latent RBAC drift
- reconcilers re-render from stale defaults
- partial version alignment creates rollback confusion
You can pass upgrade mechanics and still fail platform behavior.
That’s how teams burn days while every individual subsystem claims success.
The pressure is real, but so is the blast radius
Cost pressure from extended support is legitimate. Waiting indefinitely is also a risk.
But urgency has a predictable trap: treating the upgrade as a calendar event instead of a behavior-preservation exercise.
If you compress timelines without adding gates, you don’t remove risk. You defer it into production restart churn, where every hidden assumption gets re-tested at once.
That is the expensive path even when the ticket closes quickly.
We used a gate-first sequence
The sequence that made sense for us was not novel. It was disciplined.
- Freeze known-good baseline.
- Rehearse full path in dev.
- Promote the same gates to prod.
The key was what counts as “known-good.”
It was not only cluster version and pod health. It included reconcile inputs, RBAC assumptions, and controller restart behavior.
Step 1: freeze a reliability baseline
Before touching EKS versions, we captured two classes of state:
- version/reconciler surface
- controller permission surface
Practically, that meant snapshotting:
- core component image tags
- KFP component tags
- reconcile-driving config values (
appVersion, controller env overrides) - critical service-account
can-ichecks
This converts “works on my cluster” into auditable pre-upgrade evidence.
Step 2: prove restart safety in dev
Dev rehearsal is not just “does apply complete.”
It should answer: if critical controllers restart repeatedly during upgrade churn, do they still come back cleanly?
Required gate examples:
workflow-controllerhealthy after restartkserve-controller-managerhealthy after restartml-pipeline-uiandml-pipeline-ui-artifactconverge to intended images- sample pipeline execution completes
- RBAC smoke checks all pass
If any gate fails in dev, it is not rehearsal noise. It is production prevention data.
Step 3: eliminate known drift classes before prod
We had drift classes that already proved they can bite under churn:
- version identity split across runtime and reconcile defaults
- RBAC subject mismatch across namespace placements
- controller-managed children edited directly instead of source inputs
Those had to be normalized before prod promotion.
The principle was straightforward:
do not carry known drift into an upgrade that will maximize restart and reconcile activity.
Step 4: define explicit rollback posture before rollout
Most teams define rollback after they need it.
We treat rollback as part of forward planning:
- keep previous node group capacity available until controller gates pass
- preserve prior manifest/config snapshots for quick re-apply
- define stop conditions per phase (what failure blocks progression)
This is not bureaucracy. It is latency control during incidents.
Without explicit rollback posture, every failure becomes a bespoke decision under pressure.
Step 5: run prod as staged promotion, not migration theater
For production, we treat each phase as a gate boundary:
- control plane upgrade
- managed add-ons alignment
- node group rotation
- platform-level verification
At each boundary, we re-run the same evidence path from dev:
- version snapshot
- RBAC smoke checks
- critical controller health
- one real workflow/pipeline validation
No pass, no next phase.
That sounds conservative. It is faster than rollback archaeology.
What changed in our planning posture
We stopped discussing upgrades as version transitions and started discussing them as behavior-preservation projects.
Version movement is necessary.
Behavior preservation is the goal.
In this framing, the most valuable artifacts are not upgrade scripts. They are repeatable assertions about what must still be true after restart and reconcile.
The 10-part orderly upgrade path
- Freeze a known-good baseline (runtime images + reconciler inputs).
- Define a target compatibility matrix (EKS, Kubeflow, KServe, KFP).
- Rehearse the full sequence in dev, end to end.
- Standardize source-of-truth manifests before touching prod.
- Upgrade control plane first, then add-ons, then node groups.
- Run hard gates after every phase (controllers, RBAC checks, one real pipeline).
- Remove duplicate controller placements and namespace ambiguity.
- Predefine rollback posture and stop conditions before rollout.
- Promote to prod in staged gates, never one-shot migration theater.
- Re-snapshot and drift-audit after completion to reset baseline.
The practical checklist
If I had to compress this into one checklist for teams under support-cost pressure:
- capture runtime + reconciler version matrix before touching EKS
- codify RBAC smoke checks for controller service accounts
- validate restart safety of critical controllers in dev
- align prod/dev binding subjects and reconcile defaults before promotion
- predefine rollback posture and stop conditions
- promote with explicit gates, not calendar urgency
Cost pressure should accelerate discipline, not bypass it.
The line I keep
An orderly upgrade is not about being cautious.
It is about making restart behavior and reconciler truth explicit before the platform makes them explicit for you.
#eks #kubernetes #kubeflow #reliability #platform-engineering