Eks on ferkakta.dev

Zero-touch multi-tenant deploys: removing myself from the critical path

Mon, 02 Mar 2026 09:00:00 -0600

I had provisioned two tenants when I realized the deploy process didn’t scale to three. Each tenant on ramparts runs three services – api-server, web-client (the React frontend), tenant-auth – each with its own Docker image in ECR. Deploying a release meant running gh workflow run deploy-tenant.yml -f tenant_name=acme -f action=apply -f update_images=true, then doing it again for the next tenant. With 3 services resolving per run and N tenants, I was the bottleneck. Not Terraform, not GitHub Actions, not ECR. Me, remembering which tenants existed and typing their names correctly.

Per-Tenant CloudWatch Log Isolation on EKS, or: Why I Stopped Using aws-for-fluent-bit

Mon, 02 Mar 2026 00:00:00 +0000

The starting assumption

I’m building ramparts, a multi-tenant compliance platform running on EKS. Each tenant gets a Kubernetes namespace – tenant-acme, tenant-globex, whatever – and the compliance controls require that their application logs land in isolated storage with 365-day retention. CMMC maps this to AU-2 (audit events), AU-3 (audit content), AU-11 (retention), and AC-4 (information flow isolation). A tenant cannot read another tenant’s container output.

The obvious first move was aws-for-fluent-bit, AWS’s own Helm chart and container image for shipping logs to CloudWatch. AWS service, AWS chart, AWS logging destination. The blessed path.

Why we removed aws-for-fluent-bit from EKS

Mon, 02 Mar 2026 00:00:00 +0000

We deployed aws-for-fluent-bit because AWS recommends it.

If you follow the EKS logging documentation, that’s the default path. It assumes you use AWS’s distribution of Fluent Bit rather than the upstream Helm chart.

We did.

Two days later, we ripped it out.

The AWS chart and the upstream chart are not the same thing. The differences aren’t cosmetic. They affect how quickly you receive security patches, how transparently your configuration maps to the underlying plugin, and how many boundaries sit between your logs and the CloudWatch API.

An orderly EKS and Kubeflow upgrade path

Fri, 27 Feb 2026 00:00:00 +0000

When EKS extended-support pricing is on the horizon, upgrade planning gets emotional fast.

The worst time to discover platform ambiguity is when finance and timelines are both tightening.

Our first impulse was to ask, “how quickly can we upgrade?”

The better question was, “what order of operations prevents us from compounding hidden drift during upgrade churn?”

Why one-shot upgrades fail in controller-heavy stacks

On paper, “upgrade EKS then bump Kubeflow” sounds linear.

Your terraform apply is silently rolling back your container images

Tue, 17 Feb 2026 09:00:00 -0600

Every “deploy to EKS with GitHub Actions” tutorial solves the same problem: build an image, push to ECR, deploy it. The tutorial ends at “your pod is running.” Nobody talks about day two.

The silent rollback

Day two: you have a running EKS cluster with three services per tenant. You need to change an IAM policy. You open a PR, touch one line of Terraform, run terraform apply.

Your IAM policy updates. Your container images also update — to whatever was hardcoded in variables.tf as the default. That default was correct three months ago. Your services just rolled back to a three-month-old image and nobody noticed because the deployment succeeded.

What building infrastructure for a startup actually looks like

Wed, 11 Feb 2026 09:00:00 -0600

I spent a day doing the unglamorous infrastructure work that keeps a startup alive. Here’s everything that happened.

Morning: security audit

Audited two EKS clusters for a K8s privilege escalation vulnerability. Found 9 service accounts with cluster-admin that didn’t need it. Deleted two dead deployments — ArgoCD and Velero, both mine, both abandoned months ago. The rest are kubeflow components we can’t touch until 1.36 ships the fix in April.

90 AWS resources in 5 minutes — automating multi-tenant SaaS tenant lifecycle

Tue, 10 Feb 2026 09:00:00 -0600

I recorded our entire tenant lifecycle — create, test, destroy — with no edits. Here’s what 5 minutes of infrastructure automation looks like when there are no tickets, no handoffs, and no “can someone set up the database.”

What happens on `tenant create`

One GitHub Actions workflow backed by Terraform + a Kubernetes operator:

Validates the tenant name, resolves container images from the latest release branch
Provisions ACM wildcard cert + Route53 DNS records
Creates the Tenant CRD → operator provisions PostgreSQL databases on shared RDS, seeds credentials to SSM
Terraform deploys ExternalSecrets, Deployments, Ingress — 3 services per tenant
SSM parameters auto-seeded: Redis credentials, auth URLs, signing keys — ~40 config values per tenant
Zero static credentials anywhere — IRSA for everything, secrets injected at runtime from SSM via External Secrets Operator

About 5 minutes from nothing to 90 AWS resources and running pods.