Kubernetes on ferkakta.dev

One module block per service per tenant

Fri, 27 Mar 2026 00:00:00 -0500

Every tenant on my platform gets three services: an API server, an auth service, and a frontend. Each one is a single module block in Terraform that creates a Kubernetes deployment, a ClusterIP service, an ALB ingress, IRSA for AWS access, ESO-synced secrets from SSM, and a feature flag discovery mechanism. The module is the same for all three services. The variables are different.

I extracted it into an open source module because I kept explaining the design decisions to people who asked “how do you deploy services to EKS?” and the answer was always “let me show you the module.” The module is the answer.

from feature_flags import *

Wed, 25 Mar 2026 21:00:00 -0500

A colleague needed a feature flag enabled on one tenant. FEATURE_FLAG_ENABLE_AGENTS=True — one environment variable, one pod. I added it to the K8s secret manually, restarted the pod, and he was unblocked in two minutes.

Then I realized: the next terraform apply would overwrite that secret without the flag. The ExternalSecret syncs from SSM, and the flag wasn’t in SSM through any path terraform knew about. My manual fix had a shelf life of one deploy.

An orderly EKS and Kubeflow upgrade path

Fri, 27 Feb 2026 00:00:00 +0000

When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.

The worst time to discover platform ambiguity is when finance and timelines are both tightening.

Our first impulse was to ask, “how quickly can we upgrade?”

The better question was, “what order of operations prevents us from compounding hidden drift during upgrade churn?”

Why one-shot upgrades fail in controller-heavy stacks

On paper, “upgrade EKS then bump Kubeflow” sounds linear.

Drift is an availability bug

Fri, 27 Feb 2026 00:00:00 +0000

I used to think of drift as a config hygiene issue.

Annoying, expensive, embarrassing — but fundamentally administrative.

Then I watched two control-plane components fall into CrashLoopBackOff inside a production incident and realized the framing was wrong.

Drift is not a paperwork problem. Drift is an availability bug.

The incident looked like random failure

We were already deep in one fire: a Kubeflow Pipelines frontend image that kept reverting to an old tag.

Kubeflow is a version matrix, not a version

Fri, 27 Feb 2026 00:00:00 +0000

“What version of Kubeflow are we on?”

That looks like a simple platform inventory question.

In practice, it was one of the most misleading questions in our incident.

We had already fixed one visible symptom — image reconciliation behavior that kept reverting a frontend component — when we started asking version questions to prevent recurrence.

The expected answer was one number.

The real answer was a matrix.

The false confidence moment

The dangerous moment was not when something failed. It was when everything looked green enough to stop looking.

When a namespace owns your deployment

Fri, 27 Feb 2026 00:00:00 +0000

I spent a Friday morning trying to update one image tag.

Old image: gcr.io/ml-pipeline/frontend:2.0.5. New image: ghcr.io/kubeflow/kfp-frontend:2.5.0.

The deployment accepted the edit. Then it snapped back. I edited again. It snapped back again.

At first, I treated this as a normal ownership chain problem: Deployment -> ReplicaSet -> Pod. If my edit is getting reverted, some higher-level controller must be writing the deployment. Fair enough. Find the controller, patch the source, move on.

Making a Kopf operator idempotent: three-layer existence checks and the redisReady race

Fri, 20 Feb 2026 12:00:00 -0500

Our tenant operator provisions databases, cache users, and credentials for each tenant in a multi-tenant SaaS platform. PostgreSQL roles on shared RDS, ElastiCache RBAC users, SSM parameters with generated passwords. It worked exactly once per tenant. The second time it ran, it regenerated every password and overwrote every SSM parameter. Running services holding the old credentials immediately lost their database and cache connections.

This was the blocker for auto-deploy.

Every deploy was a coordinated outage

The orchestrator runs terraform apply for each tenant on every deploy. Terraform reconciles the Tenant CRD, which fires Kopf’s on_tenant_create handler. The handler doesn’t distinguish between “new tenant” and “existing tenant whose CRD was re-applied.” It generates fresh passwords, creates new PostgreSQL roles (which fail because the role exists, or worse, succeed and orphan the old one), and overwrites SSM parameters with credentials that no running pod knows about.