Drift is an availability bug

2026-02-27

I used to think of drift as a config hygiene issue.

Annoying, expensive, embarrassing — but fundamentally administrative.

Then I watched two control-plane components fall into CrashLoopBackOff inside a production incident and realized the framing was wrong.

Drift is not a paperwork problem. Drift is an availability bug.

The incident looked like random failure

We were already deep in one fire: a Kubeflow Pipelines frontend image that kept reverting to an old tag.

Manual edits “worked” and then disappeared. Reconciliation behavior was non-obvious. Ownership did not look like the standard Kubernetes story.

While we were stabilizing that, unrelated-looking failures appeared:

workflow-controller crash looping
kserve-controller-manager crash looping

At first glance, this looked like cascading platform chaos. Too many moving parts. Hard to isolate.

The logs were less dramatic than the symptoms.

They were full of forbidden.

No novel crash signature. No impossible race. No kernel-level weirdness. Just service accounts attempting list/watch calls they were no longer authorized to perform.

The timeline was short and deterministic

The sequence in production was tight enough that it felt like everything was breaking at once.

Frontend image issue surfaced first through repeated reconciliation reverts.
Controller restarts followed after rollout and reconcile activity.
Startup permission checks failed immediately for affected service accounts.
Kubernetes restarted failing pods on loop.

The important detail is what did not happen.

No component degraded slowly. No hidden memory pressure. No intermittent network split.

Each crash happened exactly where startup logic touched API permissions that no longer matched reality.

That is why I now treat this class of failure as deterministic debt execution, not probabilistic bad luck.

Restart churn executed old debt

The critical shift for me was causality.

The restarts did not create the defect. The restarts executed state that was already wrong.

In one path, a cluster role binding pointed to a service account namespace that did not match where the active controller was running.

In another path, workflow-controller lacked expected cluster-scope permissions and kept failing as soon as it tried to initialize watches.

Everything behaved deterministically.

Pods came up. Controllers initialized. RBAC checks failed. Process exited. Kubernetes restarted it. Repeat.

From the outside, it looked chaotic.

From the inside, it was a loop around stale control-plane assumptions.

Why this class of outage hides until pressure

Drift in controllers and RBAC is often latent during calm periods.

If a pod stays up for weeks, nobody notices that the next restart will fail a permission check at startup.

If reconciler input and rendered child state are misaligned, nobody notices until a reconcile trigger or rollout lands at the wrong time.

Scale-to-zero patterns, spot interruptions, node recycling, and routine restarts don’t “cause” these incidents.

They are drift detectors.

They force systems to re-prove assumptions you stopped testing.

That framing changed how I think about reliability in clusters with controller-heavy platforms like Kubeflow.

I no longer ask, “is this stable today?”

I ask, “if every critical controller restarts in the next ten minutes, do we still have a platform?”

The debugging path got simpler when we treated it as reliability work

Once we stopped treating this as “Kubeflow weirdness” and treated it as control-plane reliability, the response sharpened.

We asked:

which service account is actually running this controller?
what exact list/watch calls does startup require?
does kubectl auth can-i return yes for those calls right now?
what binding subject drift exists between dev and prod?

That moved us from intuition to evidence quickly.

We fixed bindings, re-ran permission checks, restarted, and got clean rollouts.

No heroics. No superstition.

Just boundary validation.

Process changed after this incident

The operational changes were not glamorous, but they were specific.

First, we stopped accepting “probably fine” controller state.

We now run RBAC smoke checks as first-class post-change gates for critical service accounts. If can-i fails, the rollout is blocked before we discover it through crash loops.

Second, we aligned prod and dev subject bindings where drift had become normalized. “It works in this namespace here” is not an acceptable steady state if it encodes hidden assumptions that break under restart.

Third, we made reconciler inputs explicit where we had been relying on implicit defaults.

In this incident, pipeline-install-config.appVersion looked informational but influenced runtime behavior through controller env wiring. Treating that value as metadata instead of control input was part of the drift surface.

Fourth, we codified visibility.

Version snapshots and RBAC checks moved from ad hoc troubleshooting into repeatable scripts. The scripts are simple by design. The value is not cleverness. The value is that we can ask the same questions every time.

Availability posture changed

I used to file drift under cleanliness.

Now I file drift under uptime.

If a controller restart can take down platform behavior, RBAC and reconciliation drift are in your critical path.

That makes them availability concerns, full stop.

Drift is an availability bug.

#kubernetes #kubeflow #rbac #reliability #platform-engineering