Platform-Engineering on ferkakta.dev

The Allow SCP that worked until it didn't

Tue, 24 Mar 2026 21:00:00 -0600

I run a multi-tenant SaaS platform on AWS with Control Tower managing the organization. Control Tower deploys a region deny guardrail — an SCP that blocks API calls outside your home region. The mechanism is a NotAction deny: it lists services that are allowed to operate globally (IAM, CloudFront, Route 53, a few dozen others), and denies everything else when aws:RequestedRegion doesn’t match your approved list.

This guardrail is one of the first things you hit when you try to do anything interesting. And the documentation says you can’t override a deny with an allow.

I assumed GovCloud was AWS with a different region code. It took two weeks to prove me wrong.

Wed, 11 Mar 2026 23:00:00 -0400

I needed a GovCloud account for a multi-tenant NIST compliance platform. I’d been running commercial AWS infrastructure for months — EKS, Terraform, tenant provisioning, the whole stack. GovCloud would be the same thing in a different region. That was the assumption. It lasted about four hours.

The account that doesn’t exist yet

My management account couldn’t call CreateGovCloudAccount. The API returned ConstraintViolationException with a message about not being “enabled for access to GovCloud” and no guidance on what that meant. I filed a support case. AWS enabled the permission two days later, and as a side effect created a standalone GovCloud account that had no relationship to my Organizations structure — an orphan floating in the partition with disconnected root credentials. I still had to find it and deal with it.

I debugged a Lambda timeout for 6 hours. The fix was 4 CLI commands.

Wed, 11 Mar 2026 16:00:00 -0400

The ticket said the Lambda tracer was timing out. The Slack thread said ConnectTimeoutError to an internal tracing endpoint. Four Lambda functions had been moved into a VPC the day before so they could reach tracer.internal.ferkakta.net — an internal ALB at 10.x.x.x, only reachable from inside the VPC. The migration was verified, the API returned success, the ticket should not have existed.

The people who built this system had moved on to other projects. The people using it were in a different timezone. There was no architecture doc, no runbook, no one to pair with. I had CloudWatch, a kubectl context, and AWS credentials.

Your onboarding flow is your architecture's report card

Tue, 03 Mar 2026 00:00:00 +0000

I ran a colleague’s manual tenant onboarding flow for a multi-tenant SaaS platform. Five steps, two attempts, and a list of errors that mapped precisely to every automation gap in the system. The onboarding flow wasn’t broken. It was a diagnostic.

The five steps

The flow to bring a new tenant from nothing to working:

Run a Python registration script that calls the auth-handler API, creates an org in the identity provider, and sends a confirmation email to the devops team.
Read the devops email. Manually extract two values: a tenant hash and an org code.
Run populate scripts that seed 38 SSM parameters — 24 for apiserver, 14 for tenant-auth-service.
Trigger a GitHub Actions workflow. Terraform creates the namespace, deployments, ExternalSecrets, DNS records, HTTPS.
Manually apply Kubernetes jobs for ETL seed data and first-user creation.

Step 4 is automated. Steps 1, 2, 3, and 5 are manual. The manual steps are where the architecture’s seams show.

An orderly EKS and Kubeflow upgrade path

Fri, 27 Feb 2026 00:00:00 +0000

When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.

The worst time to discover platform ambiguity is when finance and timelines are both tightening.

Our first impulse was to ask, “how quickly can we upgrade?”

The better question was, “what order of operations prevents us from compounding hidden drift during upgrade churn?”

Why one-shot upgrades fail in controller-heavy stacks

On paper, “upgrade EKS then bump Kubeflow” sounds linear.

Drift is an availability bug

Fri, 27 Feb 2026 00:00:00 +0000

I used to think of drift as a config hygiene issue.

Annoying, expensive, embarrassing — but fundamentally administrative.

Then I watched two control-plane components fall into CrashLoopBackOff inside a production incident and realized the framing was wrong.

Drift is not a paperwork problem. Drift is an availability bug.

The incident looked like random failure

We were already deep in one fire: a Kubeflow Pipelines frontend image that kept reverting to an old tag.

Kubeflow is a version matrix, not a version

Fri, 27 Feb 2026 00:00:00 +0000

“What version of Kubeflow are we on?”

That looks like a simple platform inventory question.

In practice, it was one of the most misleading questions in our incident.

We had already fixed one visible symptom — image reconciliation behavior that kept reverting a frontend component — when we started asking version questions to prevent recurrence.

The expected answer was one number.

The real answer was a matrix.

The false confidence moment

The dangerous moment was not when something failed. It was when everything looked green enough to stop looking.

When a namespace owns your deployment

Fri, 27 Feb 2026 00:00:00 +0000

I spent a Friday morning trying to update one image tag.

Old image: gcr.io/ml-pipeline/frontend:2.0.5. New image: ghcr.io/kubeflow/kfp-frontend:2.5.0.

The deployment accepted the edit. Then it snapped back. I edited again. It snapped back again.

At first, I treated this as a normal ownership chain problem: Deployment -> ReplicaSet -> Pod. If my edit is getting reverted, some higher-level controller must be writing the deployment. Fair enough. Find the controller, patch the source, move on.