Terraform on ferkakta.dev

One module block per service per tenant

Fri, 27 Mar 2026 00:00:00 -0500

Every tenant on my platform gets three services: an API server, an auth service, and a frontend. Each one is a single module block in Terraform that creates a Kubernetes deployment, a ClusterIP service, an ALB ingress, IRSA for AWS access, ESO-synced secrets from SSM, and a feature flag discovery mechanism. The module is the same for all three services. The variables are different.

I extracted it into an open source module because I kept explaining the design decisions to people who asked “how do you deploy services to EKS?” and the answer was always “let me show you the module.” The module is the answer.

Every tool I've ever used is a CloudFormation frontend

Thu, 26 Mar 2026 18:00:00 -0500

I was reading a job description that wanted CloudFormation experience, and I had the thought that derails the actual task: I’ve spent my entire career using tools that compile down to CloudFormation and don’t mention it until something breaks. I’ve just never framed it that way.

My career is a parade of progressively nicer frontends for the same underlying control plane — but one at a time.

The first one was the AWS console. Click, wait, refresh, click. Then CloudFormation itself, which was an improvement in the way that a paper map is an improvement over asking for directions — technically correct, nearly unusable in practice. Then Serverless Framework, which promised to abstract the whole stack into a YAML file and a deploy command. Then Terraform, which promised cloud-agnostic infrastructure as code with a state model that actually worked.

from feature_flags import *

Wed, 25 Mar 2026 21:00:00 -0500

A colleague needed a feature flag enabled on one tenant. FEATURE_FLAG_ENABLE_AGENTS=True — one environment variable, one pod. I added it to the K8s secret manually, restarted the pod, and he was unblocked in two minutes.

Then I realized: the next terraform apply would overwrite that secret without the flag. The ExternalSecret syncs from SSM, and the flag wasn’t in SSM through any path terraform knew about. My manual fix had a shelf life of one deploy.

Zero-touch multi-tenant deploys: removing myself from the critical path

Mon, 02 Mar 2026 09:00:00 -0600

I had provisioned two tenants when I realized the deploy process didn’t scale to three. Each tenant on ramparts runs three services – api-server, web-client (the React frontend), tenant-auth – each with its own Docker image in ECR. Deploying a release meant running gh workflow run deploy-tenant.yml -f tenant_name=acme -f action=apply -f update_images=true, then doing it again for the next tenant. With 3 services resolving per run and N tenants, I was the bottleneck. Not Terraform, not GitHub Actions, not ECR. Me, remembering which tenants existed and typing their names correctly.

IAM trust policies silently accept wildcards in principals — and silently deny everything

Thu, 26 Feb 2026 10:00:00 -0600

I needed a cross-account IAM role in a management account that workloads in a separate devops account could assume to send email via SES. Two types of callers: one shared service with a stable role name, and N dynamically-created per-tenant roles following a naming convention like myapp-apiserver-*.

The shared service was straightforward — exact ARN in the trust policy principal. For the per-tenant roles, I wrote what looked correct:

"Principal": { "AWS": "arn:aws:iam::111111111111:role/myapp-apiserver-*" }

terraform apply succeeded. The role was created. Every assume-role call was denied.

The Over-Mighty Subject: why your site repos have too much power

Thu, 26 Feb 2026 00:00:00 +0000

Josh Marshall borrows a phrase from medieval history to describe a modern political problem: the Over-Mighty Subject. A feudal lord whose personal wealth, private army, and territorial control grew so large that he rivaled the crown itself. Not a rebel — still nominally a subject — but operating with enough independent power that the sovereign’s authority became theoretical.

I had three of them in my infrastructure. They were Terraform roots for static sites.

I replaced $489/mo in AWS Client VPN with a $3 t4g.nano running Headscale

Sat, 21 Feb 2026 09:00:00 -0600

A finops sprint surfaced $489/mo in AWS Client VPN charges. Three endpoints across two accounts, plus connection-hour fees. For a VPN that four people used. I had provisioned two of them.

At the time, they felt indispensable — secure customer access, familiar tooling, predictable behavior. In reality, they were architectural inertia.

I replaced all three with a single t4g.nano running Headscale — the open-source Tailscale coordination server. Total cost: ~$3/mo.

I genericized the Terraform and open-sourced the module.

Self-healing race conditions: when your CI/CD fails on purpose

Fri, 20 Feb 2026 11:00:00 -0500

Three app repos build Docker images and push them to ECR. On merge, each fires a repository_dispatch to an infra repo’s orchestrator workflow. The orchestrator resolves ALL service images — not just the one that triggered it — and deploys every tenant via Terraform.

What happens when two repos merge at the same time?

The sequence

T=0: web-client and tenant-auth both merge to releases/0.0.2.
T=2m: tenant-auth build finishes first, fires dispatch.
T=2.5m: Orchestrator Run A starts. Tries to resolve all 3 service images.
T=2.5m: web-client image doesn’t exist yet — still building. Run A fails at image resolution.
T=4m: web-client build finishes, fires its own dispatch.
T=4m: Orchestrator Run B starts. Run A already finished (failed), so the concurrency group is free.
T=4m: Run B resolves all 3 images. Both new ones exist now. Deploy succeeds.

The end state is correct. Both changes deployed. One workflow run failed. Nobody had to do anything.

Cross-repo auto-deploy with GitHub Actions: the orchestrator pattern

Fri, 20 Feb 2026 10:00:00 -0500

Two repos merged within seconds of each other. The first orchestrator run failed — web-client’s ECR image didn’t exist yet because the build was still running. The GitHub Actions log showed a red X, an error annotation, and a Slack notification I didn’t need to read.

Four minutes later, the second run deployed both changes. No retry logic. No manual intervention. Nobody touched anything.

I’d spent my day building a cross-repo deploy pipeline for a multi-tenant platform — three app repos pushing Docker images to ECR, one infra repo deploying the new tenant service images to EKS. The race condition was the first real test. It failed exactly the way I wanted it to.

Your terraform apply is silently rolling back your container images

Tue, 17 Feb 2026 09:00:00 -0600

Every “deploy to EKS with GitHub Actions” tutorial solves the same problem: build an image, push to ECR, deploy it. The tutorial ends at “your pod is running.” Nobody talks about day two.

The silent rollback

Day two: you have a running EKS cluster with three services per tenant. You need to change an IAM policy. You open a PR, touch one line of Terraform, run terraform apply.

Your IAM policy updates. Your container images also update — to whatever was hardcoded in variables.tf as the default. That default was correct three months ago. Your services just rolled back to a three-month-old image and nobody noticed because the deployment succeeded.

Terraform module for multi-provider DNS: define once, deploy to Route53 + Cloudflare

Mon, 16 Feb 2026 09:00:00 -0600

I manage 10 domains across Route53 and Cloudflare. When I set up multi-provider DNS on my first domain, every record had to be defined twice — once for each provider. The APIs are different enough that you can’t just copy-paste.

The duplication got old fast. So I wrote a module.

The problem

Route53 and Cloudflare represent the same DNS data differently:

MX records: Route53 bundles priority into the value string ("10 mx1.example.com"). Cloudflare splits it into a separate priority field.

ElastiCache auth-token to RBAC migration has a Terraform provider bug

Fri, 13 Feb 2026 09:00:00 -0600

Needed to migrate a shared ElastiCache Redis cluster from a single auth token to per-user RBAC. Breaking change — every service on the cluster goes dark if you get the sequencing wrong.

The Terraform provider bug

Step one: don’t touch the real cluster. Built a throwaway copy and ran the migration there first.

Good thing — the Terraform AWS provider has a bug in the auth-token removal step. It tells you the auth token was removed. Updates its state file. The plan shows no changes. But the underlying API call silently fails. The token is still active on the cluster.

SimpleAD is Samba 4 — you can create users with ldapadd instead of ClickOps

Thu, 12 Feb 2026 09:00:00 -0600

If you’ve tried to fully automate Amazon WorkSpaces provisioning with Terraform, you’ve hit the wall: SimpleAD has no AWS API for creating directory users.

What every guide tells you

Enable WorkDocs in the console, then use the WorkDocs API to create users
Launch a domain-joined EC2 instance with RSAT tools and create users manually
RDP into a Windows management machine and use the AD admin console

All of these break the Terraform workflow. Everything is automated except the one step that creates the user your WorkSpace actually needs.

90 AWS resources in 5 minutes — automating multi-tenant SaaS tenant lifecycle

Tue, 10 Feb 2026 09:00:00 -0600

I recorded our entire tenant lifecycle — create, test, destroy — with no edits. Here’s what 5 minutes of infrastructure automation looks like when there are no tickets, no handoffs, and no “can someone set up the database.”

What happens on `tenant create`

One GitHub Actions workflow backed by Terraform + a Kubernetes operator:

Validates the tenant name, resolves container images from the latest release branch
Provisions ACM wildcard cert + Route53 DNS records
Creates the Tenant CRD → operator provisions PostgreSQL databases on shared RDS, seeds credentials to SSM
Terraform deploys ExternalSecrets, Deployments, Ingress — 3 services per tenant
SSM parameters auto-seeded: Redis credentials, auth URLs, signing keys — ~40 config values per tenant
Zero static credentials anywhere — IRSA for everything, secrets injected at runtime from SSM via External Secrets Operator

About 5 minutes from nothing to 90 AWS resources and running pods.