Kubeflow is a version matrix, not a version
“What version of Kubeflow are we on?”
That looks like a simple platform inventory question.
In practice, it was one of the most misleading questions in our incident.
We had already fixed one visible symptom — image reconciliation behavior that kept reverting a frontend component — when we started asking version questions to prevent recurrence.
The expected answer was one number.
The real answer was a matrix.
The false confidence moment
The dangerous moment was not when something failed. It was when everything looked green enough to stop looking.
ml-pipeline-ui was running ghcr.io/kubeflow/kfp-frontend:2.5.0. The pod was healthy. The page loaded. At a glance, “upgrade complete” sounded defensible.
Then we checked controller inputs and found pipeline-install-config.appVersion still at 2.0.5.
So we had two truths at once:
- runtime leaf looked upgraded
- reconciler defaults still pointed to older version policy
That is the exact shape of false confidence in controller-heavy systems. You don’t fail immediately. You wait until a reconcile trigger asks the system which truth should win.
We got multiple answers, and they were all true
Core control-plane components were on v1.8.0 tags.
Pipelines configuration still carried 2.0.5 in pipeline-install-config.appVersion.
A live frontend deployment was running ghcr.io/kubeflow/kfp-frontend:2.5.0 because we had explicitly overridden controller inputs.
Other pipeline-adjacent components still referenced 2.0.5 images.
If I answered “Kubeflow 1.8,” I would be directionally correct and operationally incomplete.
If I answered “KFP 2.5.0,” I would be correct for one visible path and wrong for controller defaults that still mattered.
If I answered “2.0.5,” I would be accurate for a key config source but not for current runtime behavior in all child resources.
The contradiction is only apparent if you assume this system is single-versioned.
It is not.
The mistake was semantic, not technical
I treated “version” as identity.
In this stack, version behaves more like policy distributed across layers:
- control-plane image tags
- controller env defaults
- configmap values consumed at reconcile time
- child resources rendered from parent reconciliation
- manually introduced overrides
You don’t have one truth. You have interacting truths.
During calm periods, this feels like complexity.
During incidents, it becomes risk surface.
Why this matters operationally
Mixed version state does not just confuse dashboards.
It changes what gets reconciled next.
In our case, pipeline-install-config.appVersion was not decorative metadata. It flowed into KFP_VERSION for the profile-controller path.
Without explicit image overrides, reconciliation could pull child specs back toward older defaults.
That is how teams end up saying “we upgraded this” and “it reverted” in the same hour while both statements are locally correct.
The system is doing what it was told. It was told inconsistent things.
What worked was making the matrix explicit
The fix pattern was not glamorous.
We stopped asking for one canonical number and started inventorying the version surface explicitly:
- core Kubeflow component tags
- KFP component tags
- controller override configmaps
pipeline-install-config.appVersion- active controller locations and service-account bindings
Then we aligned intent where drift was dangerous.
Concretely:
- pinned explicit frontend image/tag in controller env
- updated
pipeline-install-config.appVersionto match runtime intent - kept visualization image/tag explicit while transitioning
- verified reconciled child specs after parent-triggered sync
The question changed from “what version are we on?” to “which layer will win on next reconcile?”
That is a much more useful incident question.
The version matrix scoreboard
We now use a simple scoreboard before calling a platform “upgraded”:
- Runtime surface: current deployment image tags for critical components.
- Reconciler defaults: controller env and configmaps that generate child specs.
- Reconcile outcome: post-sync child specs in managed namespaces.
- Permission path: RBAC for service accounts that execute reconcile loops.
If any lane disagrees, the upgrade is not done. It is partially converged.
That wording matters because “partially converged” invites deliberate follow-up, while “done” invites drift to mature quietly.
Version matrix is normal in controller-heavy platforms
Kubeflow is not unique in having a distributed version surface.
Any platform built from many operators/controllers with different release cadences will behave similarly. The more reconciliation boundaries you have, the less meaningful a single semantic version becomes for runtime truth.
This is why I now separate:
- marketing/version identity (useful for procurement and broad planning)
- reconciliation/runtime identity (useful for reliability)
Both are valid. Only one helps you survive a restart under pressure.
Practical baseline going forward
We added a simple version snapshot script to print what matters in one pass:
kubeflow-version-snapshot.sh: https://gist.github.com/fizz/307ce198f24c78b55a721f80971e491e- core image tags
- KFP component images
- config values that actually steer reconciliation
- inferred release line with explicit caveats
Sample output from mlinfra-prod:
Kubeflow Version Snapshot
Context: mlinfra-prod
Namespace: kubeflow
== Core Kubeflow Components ==
centraldashboard ...:v1.8.0
profiles-deployment ...:v1.8.0
volumes-web-app-deployment ...:v1.8.0
== KFP Components ==
ml-pipeline-ui ghcr.io/kubeflow/kfp-frontend:2.5.0
ml-pipeline gcr.io/ml-pipeline/api-server:2.0.5
metadata-writer gcr.io/ml-pipeline/metadata-writer:2.0.5
...
== KFP Install Config ==
pipeline-install-config.appVersion: 2.5.0
profile-controller FRONTEND_IMAGE: ghcr.io/kubeflow/kfp-frontend
profile-controller FRONTEND_TAG: 2.5.0
== Inferred Kubeflow Release Line (best effort) ==
core tag: v1.8.0
inferred kubeflow line: v1.8.x
It is intentionally boring.
That is the point.
Reliability work improves when the same question gets the same evidence every time.
The line I keep from this
If your platform has multiple reconcilers, one version number is a story, not a state model.
Kubeflow is a version matrix, not a version.
#kubeflow #kubernetes #platform-engineering #reliability #versioning