<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kubernetes on ferkakta.dev</title><link>https://ferkakta.dev/tags/kubernetes/</link><description>Recent content in Kubernetes on ferkakta.dev</description><generator>Hugo</generator><language>en-US</language><copyright>Copyright fizz.</copyright><lastBuildDate>Fri, 27 Mar 2026 00:00:00 -0500</lastBuildDate><atom:link href="https://ferkakta.dev/tags/kubernetes/index.xml" rel="self" type="application/rss+xml"/><item><title>One module block per service per tenant</title><link>https://ferkakta.dev/one-module-block-per-service-per-tenant/</link><pubDate>Fri, 27 Mar 2026 00:00:00 -0500</pubDate><guid>https://ferkakta.dev/one-module-block-per-service-per-tenant/</guid><description>&lt;p&gt;Every tenant on my platform gets three services: an API server, an auth service, and a frontend. Each one is a single module block in Terraform that creates a Kubernetes deployment, a ClusterIP service, an ALB ingress, IRSA for AWS access, ESO-synced secrets from SSM, and a feature flag discovery mechanism. The module is the same for all three services. The variables are different.&lt;/p&gt;
&lt;p&gt;I extracted it into an open source module because I kept explaining the design decisions to people who asked &amp;ldquo;how do you deploy services to EKS?&amp;rdquo; and the answer was always &amp;ldquo;let me show you the module.&amp;rdquo; The module is the answer.&lt;/p&gt;</description></item><item><title>from feature_flags import *</title><link>https://ferkakta.dev/from-feature-flags-import-star/</link><pubDate>Wed, 25 Mar 2026 21:00:00 -0500</pubDate><guid>https://ferkakta.dev/from-feature-flags-import-star/</guid><description>&lt;p&gt;A colleague needed a feature flag enabled on one tenant. &lt;code&gt;FEATURE_FLAG_ENABLE_AGENTS=True&lt;/code&gt; — one environment variable, one pod. I added it to the K8s secret manually, restarted the pod, and he was unblocked in two minutes.&lt;/p&gt;
&lt;p&gt;Then I realized: the next terraform apply would overwrite that secret without the flag. The ExternalSecret syncs from SSM, and the flag wasn&amp;rsquo;t in SSM through any path terraform knew about. My manual fix had a shelf life of one deploy.&lt;/p&gt;</description></item><item><title>An orderly EKS and Kubeflow upgrade path</title><link>https://ferkakta.dev/orderly-eks-kubeflow-upgrade-path/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/orderly-eks-kubeflow-upgrade-path/</guid><description>&lt;p&gt;When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.&lt;/p&gt;
&lt;p&gt;The worst time to discover platform ambiguity is when finance and timelines are both tightening.&lt;/p&gt;
&lt;p&gt;Our first impulse was to ask, &amp;ldquo;how quickly can we upgrade?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The better question was, &amp;ldquo;what order of operations prevents us from compounding hidden drift during upgrade churn?&amp;rdquo;&lt;/p&gt;
&lt;h2 id="why-one-shot-upgrades-fail-in-controller-heavy-stacks"&gt;Why one-shot upgrades fail in controller-heavy stacks&lt;/h2&gt;
&lt;p&gt;On paper, &amp;ldquo;upgrade EKS then bump Kubeflow&amp;rdquo; sounds linear.&lt;/p&gt;</description></item><item><title>Drift is an availability bug</title><link>https://ferkakta.dev/drift-is-an-availability-bug/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/drift-is-an-availability-bug/</guid><description>&lt;p&gt;I used to think of drift as a config hygiene issue.&lt;/p&gt;
&lt;p&gt;Annoying, expensive, embarrassing — but fundamentally administrative.&lt;/p&gt;
&lt;p&gt;Then I watched two control-plane components fall into &lt;code&gt;CrashLoopBackOff&lt;/code&gt; inside a production incident and realized the framing was wrong.&lt;/p&gt;
&lt;p&gt;Drift is not a paperwork problem. Drift is an availability bug.&lt;/p&gt;
&lt;h2 id="the-incident-looked-like-random-failure"&gt;The incident looked like random failure&lt;/h2&gt;
&lt;p&gt;We were already deep in one fire: a Kubeflow Pipelines frontend image that kept reverting to an old tag.&lt;/p&gt;</description></item><item><title>Kubeflow is a version matrix, not a version</title><link>https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/</guid><description>&lt;p&gt;&amp;ldquo;What version of Kubeflow are we on?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That looks like a simple platform inventory question.&lt;/p&gt;
&lt;p&gt;In practice, it was one of the most misleading questions in our incident.&lt;/p&gt;
&lt;p&gt;We had already fixed one visible symptom — image reconciliation behavior that kept reverting a frontend component — when we started asking version questions to prevent recurrence.&lt;/p&gt;
&lt;p&gt;The expected answer was one number.&lt;/p&gt;
&lt;p&gt;The real answer was a matrix.&lt;/p&gt;
&lt;h2 id="the-false-confidence-moment"&gt;The false confidence moment&lt;/h2&gt;
&lt;p&gt;The dangerous moment was not when something failed. It was when everything looked green enough to stop looking.&lt;/p&gt;</description></item><item><title>When a namespace owns your deployment</title><link>https://ferkakta.dev/when-a-namespace-owns-your-deployment/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/when-a-namespace-owns-your-deployment/</guid><description>&lt;p&gt;I spent a Friday morning trying to update one image tag.&lt;/p&gt;
&lt;p&gt;Old image: &lt;code&gt;gcr.io/ml-pipeline/frontend:2.0.5&lt;/code&gt;.
New image: &lt;code&gt;ghcr.io/kubeflow/kfp-frontend:2.5.0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The deployment accepted the edit. Then it snapped back. I edited again. It snapped back again.&lt;/p&gt;
&lt;p&gt;At first, I treated this as a normal ownership chain problem: &lt;code&gt;Deployment -&amp;gt; ReplicaSet -&amp;gt; Pod&lt;/code&gt;. If my edit is getting reverted, some higher-level controller must be writing the deployment. Fair enough. Find the controller, patch the source, move on.&lt;/p&gt;</description></item><item><title>Making a Kopf operator idempotent: three-layer existence checks and the redisReady race</title><link>https://ferkakta.dev/kopf-operator-idempotency-three-layer-check/</link><pubDate>Fri, 20 Feb 2026 12:00:00 -0500</pubDate><guid>https://ferkakta.dev/kopf-operator-idempotency-three-layer-check/</guid><description>&lt;p&gt;Our tenant operator provisions databases, cache users, and credentials for each tenant in a multi-tenant SaaS platform. PostgreSQL roles on shared RDS, ElastiCache RBAC users, SSM parameters with generated passwords. It worked exactly once per tenant. The second time it ran, it regenerated every password and overwrote every SSM parameter. Running services holding the old credentials immediately lost their database and cache connections.&lt;/p&gt;
&lt;p&gt;This was the blocker for auto-deploy.&lt;/p&gt;
&lt;h2 id="every-deploy-was-a-coordinated-outage"&gt;Every deploy was a coordinated outage&lt;/h2&gt;
&lt;p&gt;The orchestrator runs &lt;code&gt;terraform apply&lt;/code&gt; for each tenant on every deploy. Terraform reconciles the Tenant CRD, which fires Kopf&amp;rsquo;s &lt;code&gt;on_tenant_create&lt;/code&gt; handler. The handler doesn&amp;rsquo;t distinguish between &amp;ldquo;new tenant&amp;rdquo; and &amp;ldquo;existing tenant whose CRD was re-applied.&amp;rdquo; It generates fresh passwords, creates new PostgreSQL roles (which fail because the role exists, or worse, succeed and orphan the old one), and overwrites SSM parameters with credentials that no running pod knows about.&lt;/p&gt;</description></item></channel></rss>