<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Platform-Engineering on ferkakta.dev</title><link>https://ferkakta.dev/tags/platform-engineering/</link><description>Recent content in Platform-Engineering on ferkakta.dev</description><generator>Hugo</generator><language>en-US</language><copyright>Copyright fizz.</copyright><lastBuildDate>Tue, 24 Mar 2026 21:00:00 -0600</lastBuildDate><atom:link href="https://ferkakta.dev/tags/platform-engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>The Allow SCP that worked until it didn't</title><link>https://ferkakta.dev/scp-allow-overrides-notaction-deny/</link><pubDate>Tue, 24 Mar 2026 21:00:00 -0600</pubDate><guid>https://ferkakta.dev/scp-allow-overrides-notaction-deny/</guid><description>&lt;p&gt;I run a multi-tenant SaaS platform on AWS with Control Tower managing the organization. Control Tower deploys a region deny guardrail — an SCP that blocks API calls outside your home region. The mechanism is a &lt;code&gt;NotAction&lt;/code&gt; deny: it lists services that are allowed to operate globally (IAM, CloudFront, Route 53, a few dozen others), and denies everything else when &lt;code&gt;aws:RequestedRegion&lt;/code&gt; doesn&amp;rsquo;t match your approved list.&lt;/p&gt;
&lt;p&gt;This guardrail is one of the first things you hit when you try to do anything interesting. And the documentation says you can&amp;rsquo;t override a deny with an allow.&lt;/p&gt;</description></item><item><title>I assumed GovCloud was AWS with a different region code. It took two weeks to prove me wrong.</title><link>https://ferkakta.dev/govcloud-surprises/</link><pubDate>Wed, 11 Mar 2026 23:00:00 -0400</pubDate><guid>https://ferkakta.dev/govcloud-surprises/</guid><description>&lt;p&gt;I needed a GovCloud account for a multi-tenant NIST compliance platform. I&amp;rsquo;d been running commercial AWS infrastructure for months — EKS, Terraform, tenant provisioning, the whole stack. GovCloud would be the same thing in a different region. That was the assumption. It lasted about four hours.&lt;/p&gt;
&lt;h2 id="the-account-that-doesnt-exist-yet"&gt;The account that doesn&amp;rsquo;t exist yet&lt;/h2&gt;
&lt;p&gt;My management account couldn&amp;rsquo;t call &lt;code&gt;CreateGovCloudAccount&lt;/code&gt;. The API returned &lt;code&gt;ConstraintViolationException&lt;/code&gt; with a message about not being &amp;ldquo;enabled for access to GovCloud&amp;rdquo; and no guidance on what that meant. I filed a support case. AWS enabled the permission two days later, and as a side effect created a standalone GovCloud account that had no relationship to my Organizations structure — an orphan floating in the partition with disconnected root credentials. I still had to find it and deal with it.&lt;/p&gt;</description></item><item><title>I debugged a Lambda timeout for 6 hours. The fix was 4 CLI commands.</title><link>https://ferkakta.dev/lambda-timeout-forensic-arc/</link><pubDate>Wed, 11 Mar 2026 16:00:00 -0400</pubDate><guid>https://ferkakta.dev/lambda-timeout-forensic-arc/</guid><description>&lt;p&gt;The ticket said the Lambda tracer was timing out. The Slack thread said &lt;code&gt;ConnectTimeoutError&lt;/code&gt; to an internal tracing endpoint. Four Lambda functions had been moved into a VPC the day before so they could reach &lt;code&gt;tracer.internal.ferkakta.net&lt;/code&gt; — an internal ALB at &lt;code&gt;10.x.x.x&lt;/code&gt;, only reachable from inside the VPC. The migration was verified, the API returned success, the ticket should not have existed.&lt;/p&gt;
&lt;p&gt;The people who built this system had moved on to other projects. The people using it were in a different timezone. There was no architecture doc, no runbook, no one to pair with. I had CloudWatch, a kubectl context, and AWS credentials.&lt;/p&gt;</description></item><item><title>Your onboarding flow is your architecture's report card</title><link>https://ferkakta.dev/onboarding-flow-architecture-report-card/</link><pubDate>Tue, 03 Mar 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/onboarding-flow-architecture-report-card/</guid><description>&lt;p&gt;I ran a colleague&amp;rsquo;s manual tenant onboarding flow for a multi-tenant SaaS platform. Five steps, two attempts, and a list of errors that mapped precisely to every automation gap in the system. The onboarding flow wasn&amp;rsquo;t broken. It was a diagnostic.&lt;/p&gt;
&lt;h2 id="the-five-steps"&gt;The five steps&lt;/h2&gt;
&lt;p&gt;The flow to bring a new tenant from nothing to working:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run a Python registration script that calls the auth-handler API, creates an org in the identity provider, and sends a confirmation email to the devops team.&lt;/li&gt;
&lt;li&gt;Read the devops email. Manually extract two values: a tenant hash and an org code.&lt;/li&gt;
&lt;li&gt;Run populate scripts that seed 38 SSM parameters — 24 for &lt;code&gt;apiserver&lt;/code&gt;, 14 for &lt;code&gt;tenant-auth-service&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Trigger a GitHub Actions workflow. Terraform creates the namespace, deployments, ExternalSecrets, DNS records, HTTPS.&lt;/li&gt;
&lt;li&gt;Manually apply Kubernetes jobs for ETL seed data and first-user creation.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Step 4 is automated. Steps 1, 2, 3, and 5 are manual. The manual steps are where the architecture&amp;rsquo;s seams show.&lt;/p&gt;</description></item><item><title>An orderly EKS and Kubeflow upgrade path</title><link>https://ferkakta.dev/orderly-eks-kubeflow-upgrade-path/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/orderly-eks-kubeflow-upgrade-path/</guid><description>&lt;p&gt;When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.&lt;/p&gt;
&lt;p&gt;The worst time to discover platform ambiguity is when finance and timelines are both tightening.&lt;/p&gt;
&lt;p&gt;Our first impulse was to ask, &amp;ldquo;how quickly can we upgrade?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The better question was, &amp;ldquo;what order of operations prevents us from compounding hidden drift during upgrade churn?&amp;rdquo;&lt;/p&gt;
&lt;h2 id="why-one-shot-upgrades-fail-in-controller-heavy-stacks"&gt;Why one-shot upgrades fail in controller-heavy stacks&lt;/h2&gt;
&lt;p&gt;On paper, &amp;ldquo;upgrade EKS then bump Kubeflow&amp;rdquo; sounds linear.&lt;/p&gt;</description></item><item><title>Drift is an availability bug</title><link>https://ferkakta.dev/drift-is-an-availability-bug/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/drift-is-an-availability-bug/</guid><description>&lt;p&gt;I used to think of drift as a config hygiene issue.&lt;/p&gt;
&lt;p&gt;Annoying, expensive, embarrassing — but fundamentally administrative.&lt;/p&gt;
&lt;p&gt;Then I watched two control-plane components fall into &lt;code&gt;CrashLoopBackOff&lt;/code&gt; inside a production incident and realized the framing was wrong.&lt;/p&gt;
&lt;p&gt;Drift is not a paperwork problem. Drift is an availability bug.&lt;/p&gt;
&lt;h2 id="the-incident-looked-like-random-failure"&gt;The incident looked like random failure&lt;/h2&gt;
&lt;p&gt;We were already deep in one fire: a Kubeflow Pipelines frontend image that kept reverting to an old tag.&lt;/p&gt;</description></item><item><title>Kubeflow is a version matrix, not a version</title><link>https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/</guid><description>&lt;p&gt;&amp;ldquo;What version of Kubeflow are we on?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That looks like a simple platform inventory question.&lt;/p&gt;
&lt;p&gt;In practice, it was one of the most misleading questions in our incident.&lt;/p&gt;
&lt;p&gt;We had already fixed one visible symptom — image reconciliation behavior that kept reverting a frontend component — when we started asking version questions to prevent recurrence.&lt;/p&gt;
&lt;p&gt;The expected answer was one number.&lt;/p&gt;
&lt;p&gt;The real answer was a matrix.&lt;/p&gt;
&lt;h2 id="the-false-confidence-moment"&gt;The false confidence moment&lt;/h2&gt;
&lt;p&gt;The dangerous moment was not when something failed. It was when everything looked green enough to stop looking.&lt;/p&gt;</description></item><item><title>When a namespace owns your deployment</title><link>https://ferkakta.dev/when-a-namespace-owns-your-deployment/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/when-a-namespace-owns-your-deployment/</guid><description>&lt;p&gt;I spent a Friday morning trying to update one image tag.&lt;/p&gt;
&lt;p&gt;Old image: &lt;code&gt;gcr.io/ml-pipeline/frontend:2.0.5&lt;/code&gt;.
New image: &lt;code&gt;ghcr.io/kubeflow/kfp-frontend:2.5.0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The deployment accepted the edit. Then it snapped back. I edited again. It snapped back again.&lt;/p&gt;
&lt;p&gt;At first, I treated this as a normal ownership chain problem: &lt;code&gt;Deployment -&amp;gt; ReplicaSet -&amp;gt; Pod&lt;/code&gt;. If my edit is getting reverted, some higher-level controller must be writing the deployment. Fair enough. Find the controller, patch the source, move on.&lt;/p&gt;</description></item></channel></rss>