ferkakta.dev

When a namespace owns your deployment

I spent a Friday morning trying to update one image tag.

Old image: gcr.io/ml-pipeline/frontend:2.0.5. New image: ghcr.io/kubeflow/kfp-frontend:2.5.0.

The deployment accepted the edit. Then it snapped back. I edited again. It snapped back again.

At first, I treated this as a normal ownership chain problem: Deployment -> ReplicaSet -> Pod. If my edit is getting reverted, some higher-level controller must be writing the deployment. Fair enough. Find the controller, patch the source, move on.

The part that broke my mental model was the owner reference.

The deployment’s owner was Namespace/admin.

I had always treated namespace as a container. Not an active parent object. Not something that owns a deployment in the reconciliation graph. A namespace groups resources. It doesn’t normally author them.

In plain Kubernetes workflows, that assumption is mostly safe.

In this Kubeflow setup, it was wrong.

Namespace was the parent object

What actually existed was a metacontroller path where the parent object is the namespace, and child resources are synthesized from controller logic. Deployments, services, and related objects are rendered from that parent state.

That means two things operationally:

First, direct edits to child objects are temporary by design. If the controller believes desired state is different, your manual patch is just an intermediate state until the next reconcile loop.

Second, a namespace metadata change can trigger a reconcile. Not because namespaces are magical, but because this controller has intentionally chosen namespace as its parent type.

I wasn’t fighting Kubernetes defaults. I was fighting a custom reconciliation contract I didn’t know was there.

The system behaved exactly as designed. My model was stale.

The image wasn’t the whole story

The top-level symptom was image reversion, but the root cause was a split source of truth.

ml-pipeline-ui was on 2.5.0 in one place, while controller inputs still pointed at 2.0.5 defaults in another place. We had runtime overrides, historical manifests, and config maps carrying version intent at different layers.

In this stack, pipeline-install-config.appVersion is not decorative metadata. It feeds KFP_VERSION into kubeflow-pipelines-profile-controller. If explicit image env vars are unset, that value can steer reconciled image tags.

So we had a cluster that could look upgraded at the leaf, while still carrying downgrade pressure in controller inputs.

That’s why this class of incident keeps feeling haunted. You patch the symptom and it holds until something reconciles from upstream config.

Crash loops were a drift detector

During cleanup, two controllers fell into CrashLoopBackOff: workflow-controller and kserve-controller-manager.

This looked unrelated at first. It wasn’t.

Logs showed straight RBAC forbidden on list/watch calls. No exotic bug. No race in business logic. Just permissions that no longer matched where service accounts were running.

One cluster role binding pointed KServe manager permissions at kserve/kserve-controller-manager while an active controller path used kubeflow/kserve-controller-manager. Another path had no cluster binding for the kubeflow/argo service account needed by workflow-controller’s watches.

Pods restarted and immediately executed drift.

That was the pattern across this incident: restart churn didn’t cause the problem. It surfaced latent configuration debt at full speed.

The useful model now

I no longer ask “who owns this deployment” as if ownership is only native workload hierarchy.

I ask:

Those four questions collapse a lot of Kubernetes mystery theater.

The strongest practical heuristic from this incident is simple: edit source, not children.

If an object is controller-managed, direct kubectl edit deployment is incident response duct tape. Durable change is always at controller input — config map, CR, chart values, or generated manifest source.

What we changed

We pinned profile-controller env vars explicitly for frontend image/tag, aligned pipeline-install-config.appVersion with runtime intent, fixed cluster RBAC for workflow and KServe controllers, and standardized RBAC subject sets across dev and prod.

Then we added two scripts and published them as standalone gists:

The scripts are boring by design. They replace intuition with evidence.

The boundary that mattered

The incident wasn’t “Kubeflow is bad” and it wasn’t “operator magic is bad.” The system had a clear boundary; we just didn’t keep it visible.

Namespace was not a passive bucket in this architecture. It was the parent object of a controller workflow.

Once we accepted that boundary, the fixes were straightforward.

Systems behave correctly. The costly part is recognizing which system you’re actually operating.

#kubernetes #kubeflow #metacontroller #incident #platform-engineering