Aws on ferkakta.dev

I answered 114 AWS Well-Architected Review questions from my terminal

Fri, 03 Apr 2026 23:00:00 -0500

I was fourteen questions into the AWS Well-Architected Review when my wrists told me to stop. Each question is a page: read the description, check the boxes, type notes into a 2084-character text field, click Next. The Container Build Lens alone has 28 questions. I had two more lenses queued — the main Well-Architected Framework (57 questions) and the Generative AI Lens (29). That’s 114 questions total, and the console wants me to click through every one.

I replaced the AWS CLI completer with a datalake

Thu, 02 Apr 2026 08:00:00 -0500

I needed to tell someone in Italy my availability in their timezone, typed TZ= and hit tab, and discovered a completer that’s apparently been sitting in zsh since the Pleistocene. That made me finally look at how completion actually works: #compdef, the dispatch table, _files, the whole vocabulary kit I’d been leaning on for years without really seeing. And in the middle of that I remembered the thing that had made me write off tab completion in the first place: aws_completer, the Python-spawning hog that claims every argument position and still makes a mockery of my left pinky finger when it innocently asks for a filename, interrupting to say: but wait, are you sure you don’t want to marry one of my 428 eligible daughters first?

FinOps portfolio: 71 tickets over 5 years

Wed, 01 Apr 2026 15:30:00 -0500

My first finops ticket was called “Optimize the AWS infrastcuture.” The typo is still there. That was 2021 — a one-person infrastructure team at a startup that didn’t have the word finops in its vocabulary and didn’t know it needed one.

Five years later I went looking for every cost-related ticket I’d ever created. I expected maybe thirty. I found 71, spread across 8 Jira projects, touching every layer of the stack from EBS volumes to LLM inference spend. Nobody asked me to create a finops practice. I just kept looking at the bill and refusing to pay for things that didn’t earn their keep.

Three holes in the partition wall

Tue, 31 Mar 2026 20:00:00 -0500

I assumed GovCloud was AWS with a different region code. I wrote a whole post about how wrong that was. The partition wall between commercial AWS and GovCloud is real — no shared IAM, no cross-partition role assumption, no federated identity, no common STS endpoints. An arn:aws: principal cannot see an arn:aws-us-gov: resource. They are separate universes connected by a billing relationship and nothing else.

Except that’s not quite true either. There are three holes in the wall, and I found them one at a time over the course of a month.

One module block per service per tenant

Fri, 27 Mar 2026 00:00:00 -0500

Every tenant on my platform gets three services: an API server, an auth service, and a frontend. Each one is a single module block in Terraform that creates a Kubernetes deployment, a ClusterIP service, an ALB ingress, IRSA for AWS access, ESO-synced secrets from SSM, and a feature flag discovery mechanism. The module is the same for all three services. The variables are different.

I extracted it into an open source module because I kept explaining the design decisions to people who asked “how do you deploy services to EKS?” and the answer was always “let me show you the module.” The module is the answer.

Every tool I've ever used is a CloudFormation frontend

Thu, 26 Mar 2026 18:00:00 -0500

I was reading a job description that wanted CloudFormation experience, and I had the thought that derails the actual task: I’ve spent my entire career using tools that compile down to CloudFormation and don’t mention it until something breaks. I’ve just never framed it that way.

My career is a parade of progressively nicer frontends for the same underlying control plane — but one at a time.

The first one was the AWS console. Click, wait, refresh, click. Then CloudFormation itself, which was an improvement in the way that a paper map is an improvement over asking for directions — technically correct, nearly unusable in practice. Then Serverless Framework, which promised to abstract the whole stack into a YAML file and a deploy command. Then Terraform, which promised cloud-agnostic infrastructure as code with a state model that actually worked.

from feature_flags import *

Wed, 25 Mar 2026 21:00:00 -0500

A colleague needed a feature flag enabled on one tenant. FEATURE_FLAG_ENABLE_AGENTS=True — one environment variable, one pod. I added it to the K8s secret manually, restarted the pod, and he was unblocked in two minutes.

Then I realized: the next terraform apply would overwrite that secret without the flag. The ExternalSecret syncs from SSM, and the flag wasn’t in SSM through any path terraform knew about. My manual fix had a shelf life of one deploy.

The Allow SCP that worked until it didn't

Tue, 24 Mar 2026 21:00:00 -0600

I run a multi-tenant SaaS platform on AWS with Control Tower managing the organization. Control Tower deploys a region deny guardrail — an SCP that blocks API calls outside your home region. The mechanism is a NotAction deny: it lists services that are allowed to operate globally (IAM, CloudFront, Route 53, a few dozen others), and denies everything else when aws:RequestedRegion doesn’t match your approved list.

This guardrail is one of the first things you hit when you try to do anything interesting. And the documentation says you can’t override a deny with an allow.

The $233 Day, Part 2: The Inference Iceberg

Fri, 20 Mar 2026 17:00:00 -0500

I posted the part 1 findings to the team thread — model switch, cache invalidation, 20× call volume, $173 training run. Case closed. The numbers were clean, the explanation was satisfying, and the model got reverted within the hour.

Except $173 was wrong. Not wrong in the analysis — the training run did cost that much. Wrong in scope. I’d found the visible part of the spend and stopped looking.

The $173 Training Run

Fri, 20 Mar 2026 15:00:00 -0500

The Slack message landed at 3pm on a Wednesday: “model training successful, previously 20min, now 1h30m.” I had finished an EKS 1.32-to-1.33 upgrade on the ramparts cluster that morning. My upgrade, my timeline, my problem.

The first theory wrote itself. New cluster version, fresh nodes, cold image caches. I’d fixed a broken cluster autoscaler earlier that day — the old autoscaler deployment was pinned to a node selector that no longer matched after the upgrade, so pods were stacking up in Pending until I caught it. First-run penalties after a major version bump are real. Everyone on the call nodded. I almost typed up that explanation and moved on.

Your employees are tenants and you should bill them like it

Mon, 16 Mar 2026 14:00:00 -0600

I built a Lambda that enriches every Bedrock invocation with cost data and routes it to per-tenant CloudWatch log groups. Model ID, input tokens, output tokens, estimated cost in USD, all written to /bedrock/tenants/{tenant} so each customer’s AI spend is visible in near-real-time.

Then a developer on the team needed Bedrock access for local development, and I had a problem I hadn’t anticipated.

The invisible burn

The developer’s use case was reasonable. He was building features against the Bedrock API and needed to iterate against real models, not mocks. I created an SSO permission set with bedrock:InvokeModel and handed him the profile name.

I assumed GovCloud was AWS with a different region code. It took two weeks to prove me wrong.

Wed, 11 Mar 2026 23:00:00 -0400

I needed a GovCloud account for a multi-tenant NIST compliance platform. I’d been running commercial AWS infrastructure for months — EKS, Terraform, tenant provisioning, the whole stack. GovCloud would be the same thing in a different region. That was the assumption. It lasted about four hours.

The account that doesn’t exist yet

My management account couldn’t call CreateGovCloudAccount. The API returned ConstraintViolationException with a message about not being “enabled for access to GovCloud” and no guidance on what that meant. I filed a support case. AWS enabled the permission two days later, and as a side effect created a standalone GovCloud account that had no relationship to my Organizations structure — an orphan floating in the partition with disconnected root credentials. I still had to find it and deal with it.

I debugged a Lambda timeout for 6 hours. The fix was 4 CLI commands.

Wed, 11 Mar 2026 16:00:00 -0400

The ticket said the Lambda tracer was timing out. The Slack thread said ConnectTimeoutError to an internal tracing endpoint. Four Lambda functions had been moved into a VPC the day before so they could reach tracer.internal.ferkakta.net — an internal ALB at 10.x.x.x, only reachable from inside the VPC. The migration was verified, the API returned success, the ticket should not have existed.

The people who built this system had moved on to other projects. The people using it were in a different timezone. There was no architecture doc, no runbook, no one to pair with. I had CloudWatch, a kubectl context, and AWS credentials.

Zero-touch multi-tenant deploys: removing myself from the critical path

Mon, 02 Mar 2026 09:00:00 -0600

I had provisioned two tenants when I realized the deploy process didn’t scale to three. Each tenant on ramparts runs three services – api-server, web-client (the React frontend), tenant-auth – each with its own Docker image in ECR. Deploying a release meant running gh workflow run deploy-tenant.yml -f tenant_name=acme -f action=apply -f update_images=true, then doing it again for the next tenant. With 3 services resolving per run and N tenants, I was the bottleneck. Not Terraform, not GitHub Actions, not ECR. Me, remembering which tenants existed and typing their names correctly.

Per-Tenant CloudWatch Log Isolation on EKS, or: Why I Stopped Using aws-for-fluent-bit

Mon, 02 Mar 2026 00:00:00 +0000

The starting assumption

I’m building ramparts, a multi-tenant compliance platform running on EKS. Each tenant gets a Kubernetes namespace – tenant-acme, tenant-globex, whatever – and the compliance controls require that their application logs land in isolated storage with 365-day retention. CMMC maps this to AU-2 (audit events), AU-3 (audit content), AU-11 (retention), and AC-4 (information flow isolation). A tenant cannot read another tenant’s container output.

The obvious first move was aws-for-fluent-bit, AWS’s own Helm chart and container image for shipping logs to CloudWatch. AWS service, AWS chart, AWS logging destination. The blessed path.

Why we removed aws-for-fluent-bit from EKS

Mon, 02 Mar 2026 00:00:00 +0000

We deployed aws-for-fluent-bit because AWS recommends it.

If you follow the EKS logging documentation, that’s the default path. It assumes you use AWS’s distribution of Fluent Bit rather than the upstream Helm chart.

We did.

Two days later, we ripped it out.

The AWS chart and the upstream chart are not the same thing. The differences aren’t cosmetic. They affect how quickly you receive security patches, how transparently your configuration maps to the underlying plugin, and how many boundaries sit between your logs and the CloudWatch API.

Stop copying AWS managed policies — deny what you don't want instead

Fri, 27 Feb 2026 00:00:00 +0000

I needed to give a developer full CloudWatch read access — metrics, alarms, dashboards, log groups — but deny access to three categories of log groups containing security-sensitive data: WorkSpaces OS event logs, VPC flow logs, and WAF request logs.

The reflex is to copy CloudWatchReadOnlyAccess into a custom policy and delete the parts you don’t want. I’ve seen this in every organization I’ve worked in. It produces a policy with 50+ actions that you now own. Every time AWS ships a new CloudWatch feature, your policy is stale. You won’t update it. It’ll rot.

The IAM policy controls access — the document controls how people feel about it

Fri, 27 Feb 2026 00:00:00 +0000

I tightened a teammate’s AWS permissions last night. Added an inline deny policy to block three categories of CloudWatch log groups — WorkSpaces OS logs, VPC flow logs, WAF request data. Five minutes of IAM work. Then I spent twenty minutes writing a document explaining every boundary, what’s accessible, what’s denied, what’s coming next, and what I haven’t designed yet.

The document mattered more than the policy.

The default is silence

Most companies handle access control the same way. Someone asks for access. An admin creates a policy. The requester gets a login link. Nobody explains what they can and can’t do, or why.

IAM trust policies silently accept wildcards in principals — and silently deny everything

Thu, 26 Feb 2026 10:00:00 -0600

I needed a cross-account IAM role in a management account that workloads in a separate devops account could assume to send email via SES. Two types of callers: one shared service with a stable role name, and N dynamically-created per-tenant roles following a naming convention like myapp-apiserver-*.

The shared service was straightforward — exact ARN in the trust policy principal. For the per-tenant roles, I wrote what looked correct:

"Principal": { "AWS": "arn:aws:iam::111111111111:role/myapp-apiserver-*" }

terraform apply succeeded. The role was created. Every assume-role call was denied.

IAM eventual consistency is 4 seconds — if your policy still doesn't work, you have a bug

Thu, 26 Feb 2026 09:00:00 -0600

I changed an IAM inline policy on a role — added an sts:AssumeRole statement so a pod could assume a cross-account SES role. Ran terraform apply. Checked the policy with get-role-policy. The old policy came back. No new statement.

I said “propagation delay” and moved on to other work.

Twenty minutes later I checked again. Same old policy. That’s not propagation.

What eventual consistency actually means

AWS IAM uses a distributed computing model. Changes to policies, roles, and credentials take time to replicate across endpoints. AWS documents this explicitly and recommends not including IAM changes in critical code paths.

The Over-Mighty Subject: why your site repos have too much power

Thu, 26 Feb 2026 00:00:00 +0000

Josh Marshall borrows a phrase from medieval history to describe a modern political problem: the Over-Mighty Subject. A feudal lord whose personal wealth, private army, and territorial control grew so large that he rivaled the crown itself. Not a rebel — still nominally a subject — but operating with enough independent power that the sovereign’s authority became theoretical.

I had three of them in my infrastructure. They were Terraform roots for static sites.

I replaced $489/mo in AWS Client VPN with a $3 t4g.nano running Headscale

Sat, 21 Feb 2026 09:00:00 -0600

A finops sprint surfaced $489/mo in AWS Client VPN charges. Three endpoints across two accounts, plus connection-hour fees. For a VPN that four people used. I had provisioned two of them.

At the time, they felt indispensable — secure customer access, familiar tooling, predictable behavior. In reality, they were architectural inertia.

I replaced all three with a single t4g.nano running Headscale — the open-source Tailscale coordination server. Total cost: ~$3/mo.

I genericized the Terraform and open-sourced the module.

Cross-repo auto-deploy with GitHub Actions: the orchestrator pattern

Fri, 20 Feb 2026 10:00:00 -0500

Two repos merged within seconds of each other. The first orchestrator run failed — web-client’s ECR image didn’t exist yet because the build was still running. The GitHub Actions log showed a red X, an error annotation, and a Slack notification I didn’t need to read.

Four minutes later, the second run deployed both changes. No retry logic. No manual intervention. Nobody touched anything.

I’d spent my day building a cross-repo deploy pipeline for a multi-tenant platform — three app repos pushing Docker images to ECR, one infra repo deploying the new tenant service images to EKS. The race condition was the first real test. It failed exactly the way I wanted it to.

Your CI/CD dispatch token can rewrite your infrastructure code

Fri, 20 Feb 2026 09:00:00 -0600

I built a cross-repo auto-deploy pipeline this week. Three app repos push Docker images to ECR, then dispatch a deploy event to the infra repo’s orchestrator workflow via repository_dispatch. Standard pattern.

The gotcha: fine-grained PATs need contents:write to call the repository_dispatch API. Not actions:write — contents:write. The permission that also lets you push code, create branches, and delete files.

My service token that should only be able to say “hey, deploy this” can also rewrite the deployment workflow it’s triggering. That’s not least privilege. That’s a door that’s three sizes too wide.

Your terraform apply is silently rolling back your container images

Tue, 17 Feb 2026 09:00:00 -0600

Every “deploy to EKS with GitHub Actions” tutorial solves the same problem: build an image, push to ECR, deploy it. The tutorial ends at “your pod is running.” Nobody talks about day two.

The silent rollback

Day two: you have a running EKS cluster with three services per tenant. You need to change an IAM policy. You open a PR, touch one line of Terraform, run terraform apply.

Your IAM policy updates. Your container images also update — to whatever was hardcoded in variables.tf as the default. That default was correct three months ago. Your services just rolled back to a three-month-old image and nobody noticed because the deployment succeeded.

Terraform module for multi-provider DNS: define once, deploy to Route53 + Cloudflare

Mon, 16 Feb 2026 09:00:00 -0600

I manage 10 domains across Route53 and Cloudflare. When I set up multi-provider DNS on my first domain, every record had to be defined twice — once for each provider. The APIs are different enough that you can’t just copy-paste.

The duplication got old fast. So I wrote a module.

The problem

Route53 and Cloudflare represent the same DNS data differently:

MX records: Route53 bundles priority into the value string ("10 mx1.example.com"). Cloudflare splits it into a separate priority field.

ElastiCache auth-token to RBAC migration has a Terraform provider bug

Fri, 13 Feb 2026 09:00:00 -0600

Needed to migrate a shared ElastiCache Redis cluster from a single auth token to per-user RBAC. Breaking change — every service on the cluster goes dark if you get the sequencing wrong.

The Terraform provider bug

Step one: don’t touch the real cluster. Built a throwaway copy and ran the migration there first.

Good thing — the Terraform AWS provider has a bug in the auth-token removal step. It tells you the auth token was removed. Updates its state file. The plan shows no changes. But the underlying API call silently fails. The token is still active on the cluster.

Amazon WorkSpaces are invisible to SSM and CloudWatch (and how to fix it)

Thu, 12 Feb 2026 10:00:00 -0600

I spent an afternoon arguing with Windows about whether I was allowed to be root on a machine I created. Six hours and six layers of undocumented workarounds later, I got CMMC-compliant audit logging on a desktop that doesn’t know it exists.

The problem

WorkSpaces don’t show up in AWS Systems Manager. They’re not EC2 instances — no instance profile, no metadata endpoint, no identity. SSM Agent is pre-installed but thinks it’s nobody. CloudWatch Agent has no credentials and doesn’t know what region it’s in.

SimpleAD is Samba 4 — you can create users with ldapadd instead of ClickOps

Thu, 12 Feb 2026 09:00:00 -0600

If you’ve tried to fully automate Amazon WorkSpaces provisioning with Terraform, you’ve hit the wall: SimpleAD has no AWS API for creating directory users.

What every guide tells you

Enable WorkDocs in the console, then use the WorkDocs API to create users
Launch a domain-joined EC2 instance with RSAT tools and create users manually
RDP into a Windows management machine and use the AD admin console

All of these break the Terraform workflow. Everything is automated except the one step that creates the user your WorkSpace actually needs.

What building infrastructure for a startup actually looks like

Wed, 11 Feb 2026 09:00:00 -0600

I spent a day doing the unglamorous infrastructure work that keeps a startup alive. Here’s everything that happened.

Morning: security audit

Audited two EKS clusters for a K8s privilege escalation vulnerability. Found 9 service accounts with cluster-admin that didn’t need it. Deleted two dead deployments — ArgoCD and Velero, both mine, both abandoned months ago. The rest are kubeflow components we can’t touch until 1.36 ships the fix in April.

90 AWS resources in 5 minutes — automating multi-tenant SaaS tenant lifecycle

Tue, 10 Feb 2026 09:00:00 -0600

I recorded our entire tenant lifecycle — create, test, destroy — with no edits. Here’s what 5 minutes of infrastructure automation looks like when there are no tickets, no handoffs, and no “can someone set up the database.”

What happens on `tenant create`

One GitHub Actions workflow backed by Terraform + a Kubernetes operator:

Validates the tenant name, resolves container images from the latest release branch
Provisions ACM wildcard cert + Route53 DNS records
Creates the Tenant CRD → operator provisions PostgreSQL databases on shared RDS, seeds credentials to SSM
Terraform deploys ExternalSecrets, Deployments, Ingress — 3 services per tenant
SSM parameters auto-seeded: Redis credentials, auth URLs, signing keys — ~40 config values per tenant
Zero static credentials anywhere — IRSA for everything, secrets injected at runtime from SSM via External Secrets Operator

About 5 minutes from nothing to 90 AWS resources and running pods.

Your ACM certificate request is a beacon — scanners are watching Certificate Transparency logs

Mon, 09 Feb 2026 09:00:00 -0600

I accidentally exposed production secrets on a public endpoint. Here’s what happened and what I learned about Certificate Transparency.

The setup

We’re building a multi-tenant SaaS platform on EKS. During development, our Terraform module defaulted to ealen/echo-server for three microservices — a lightweight HTTP server that echoes back request info. Seemed harmless.

What I missed: echo-server echoes EVERYTHING. Every environment variable in the container, including ones injected from AWS SSM via External Secrets Operator. Database connection strings. Redis auth tokens. OAuth client secrets. Signing keys. A single unauthenticated GET / returns it all as JSON.

Aws on ferkakta.dev

I answered 114 AWS Well-Architected Review questions from my terminal

I replaced the AWS CLI completer with a datalake

FinOps portfolio: 71 tickets over 5 years

Three holes in the partition wall

One module block per service per tenant

Every tool I've ever used is a CloudFormation frontend

from feature_flags import *

The Allow SCP that worked until it didn't

The $233 Day, Part 2: The Inference Iceberg

The $173 Training Run

Your employees are tenants and you should bill them like it

The invisible burn

I assumed GovCloud was AWS with a different region code. It took two weeks to prove me wrong.

The account that doesn’t exist yet

I debugged a Lambda timeout for 6 hours. The fix was 4 CLI commands.

Zero-touch multi-tenant deploys: removing myself from the critical path

Per-Tenant CloudWatch Log Isolation on EKS, or: Why I Stopped Using aws-for-fluent-bit

The starting assumption

Why we removed aws-for-fluent-bit from EKS

Stop copying AWS managed policies — deny what you don't want instead

The IAM policy controls access — the document controls how people feel about it

The default is silence

IAM trust policies silently accept wildcards in principals — and silently deny everything

IAM eventual consistency is 4 seconds — if your policy still doesn't work, you have a bug

What eventual consistency actually means

The Over-Mighty Subject: why your site repos have too much power

I replaced $489/mo in AWS Client VPN with a $3 t4g.nano running Headscale

Cross-repo auto-deploy with GitHub Actions: the orchestrator pattern

Your CI/CD dispatch token can rewrite your infrastructure code

Your terraform apply is silently rolling back your container images

The silent rollback

Terraform module for multi-provider DNS: define once, deploy to Route53 + Cloudflare

The problem

ElastiCache auth-token to RBAC migration has a Terraform provider bug

The Terraform provider bug

Amazon WorkSpaces are invisible to SSM and CloudWatch (and how to fix it)

The problem

SimpleAD is Samba 4 — you can create users with ldapadd instead of ClickOps

What every guide tells you

What building infrastructure for a startup actually looks like

Morning: security audit

90 AWS resources in 5 minutes — automating multi-tenant SaaS tenant lifecycle

What happens on tenant create

Your ACM certificate request is a beacon — scanners are watching Certificate Transparency logs

The setup

What happens on `tenant create`