<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Aws on ferkakta.dev</title><link>https://ferkakta.dev/tags/aws/</link><description>Recent content in Aws on ferkakta.dev</description><generator>Hugo</generator><language>en-US</language><copyright>Copyright fizz.</copyright><lastBuildDate>Fri, 03 Apr 2026 23:00:00 -0500</lastBuildDate><atom:link href="https://ferkakta.dev/tags/aws/index.xml" rel="self" type="application/rss+xml"/><item><title>I answered 114 AWS Well-Architected Review questions from my terminal</title><link>https://ferkakta.dev/well-architected-review-from-terminal/</link><pubDate>Fri, 03 Apr 2026 23:00:00 -0500</pubDate><guid>https://ferkakta.dev/well-architected-review-from-terminal/</guid><description>&lt;p&gt;I was fourteen questions into the AWS Well-Architected Review when my wrists told me to stop. Each question is a page: read the description, check the boxes, type notes into a 2084-character text field, click Next. The Container Build Lens alone has 28 questions. I had two more lenses queued — the main Well-Architected Framework (57 questions) and the Generative AI Lens (29). That&amp;rsquo;s 114 questions total, and the console wants me to click through every one.&lt;/p&gt;</description></item><item><title>I replaced the AWS CLI completer with a datalake</title><link>https://ferkakta.dev/aws-completer-datalake-replacement/</link><pubDate>Thu, 02 Apr 2026 08:00:00 -0500</pubDate><guid>https://ferkakta.dev/aws-completer-datalake-replacement/</guid><description>&lt;p&gt;I needed to tell someone in Italy my availability in their timezone, typed &lt;code&gt;TZ=&lt;/code&gt; and hit tab, and &lt;a href="https://ferkakta.dev/blog/zsh-completions-vocabulary-construction-kit/"&gt;discovered a completer that&amp;rsquo;s apparently been sitting in zsh since the Pleistocene&lt;/a&gt;. That made me finally look at how completion actually works: &lt;code&gt;#compdef&lt;/code&gt;, the dispatch table, &lt;code&gt;_files&lt;/code&gt;, the whole vocabulary kit I&amp;rsquo;d been leaning on for years without really seeing. And in the middle of that I remembered the thing that had made me write off tab completion in the first place: &lt;code&gt;aws_completer&lt;/code&gt;, the Python-spawning hog that claims every argument position and still makes a mockery of my left pinky finger when it innocently asks for a filename, interrupting to say: &lt;em&gt;but wait, are you sure you don&amp;rsquo;t want to marry one of my 428 eligible daughters first?&lt;/em&gt;&lt;/p&gt;</description></item><item><title>FinOps portfolio: 71 tickets over 5 years</title><link>https://ferkakta.dev/finops-portfolio/</link><pubDate>Wed, 01 Apr 2026 15:30:00 -0500</pubDate><guid>https://ferkakta.dev/finops-portfolio/</guid><description>&lt;p&gt;My first finops ticket was called &amp;ldquo;Optimize the AWS infrastcuture.&amp;rdquo; The typo is still there. That was 2021 — a one-person infrastructure team at a startup that didn&amp;rsquo;t have the word finops in its vocabulary and didn&amp;rsquo;t know it needed one.&lt;/p&gt;
&lt;p&gt;Five years later I went looking for every cost-related ticket I&amp;rsquo;d ever created. I expected maybe thirty. I found 71, spread across 8 Jira projects, touching every layer of the stack from EBS volumes to LLM inference spend. Nobody asked me to create a finops practice. I just kept looking at the bill and refusing to pay for things that didn&amp;rsquo;t earn their keep.&lt;/p&gt;</description></item><item><title>Three holes in the partition wall</title><link>https://ferkakta.dev/three-holes-in-the-partition-wall/</link><pubDate>Tue, 31 Mar 2026 20:00:00 -0500</pubDate><guid>https://ferkakta.dev/three-holes-in-the-partition-wall/</guid><description>&lt;p&gt;I assumed GovCloud was AWS with a different region code. I wrote a whole post about how wrong that was. The partition wall between commercial AWS and GovCloud is real — no shared IAM, no cross-partition role assumption, no federated identity, no common STS endpoints. An &lt;code&gt;arn:aws:&lt;/code&gt; principal cannot see an &lt;code&gt;arn:aws-us-gov:&lt;/code&gt; resource. They are separate universes connected by a billing relationship and nothing else.&lt;/p&gt;
&lt;p&gt;Except that&amp;rsquo;s not quite true either. There are three holes in the wall, and I found them one at a time over the course of a month.&lt;/p&gt;</description></item><item><title>One module block per service per tenant</title><link>https://ferkakta.dev/one-module-block-per-service-per-tenant/</link><pubDate>Fri, 27 Mar 2026 00:00:00 -0500</pubDate><guid>https://ferkakta.dev/one-module-block-per-service-per-tenant/</guid><description>&lt;p&gt;Every tenant on my platform gets three services: an API server, an auth service, and a frontend. Each one is a single module block in Terraform that creates a Kubernetes deployment, a ClusterIP service, an ALB ingress, IRSA for AWS access, ESO-synced secrets from SSM, and a feature flag discovery mechanism. The module is the same for all three services. The variables are different.&lt;/p&gt;
&lt;p&gt;I extracted it into an open source module because I kept explaining the design decisions to people who asked &amp;ldquo;how do you deploy services to EKS?&amp;rdquo; and the answer was always &amp;ldquo;let me show you the module.&amp;rdquo; The module is the answer.&lt;/p&gt;</description></item><item><title>Every tool I've ever used is a CloudFormation frontend</title><link>https://ferkakta.dev/cloudformation-frontends/</link><pubDate>Thu, 26 Mar 2026 18:00:00 -0500</pubDate><guid>https://ferkakta.dev/cloudformation-frontends/</guid><description>&lt;p&gt;I was reading a job description that wanted CloudFormation experience, and I had the thought that derails the actual task: I&amp;rsquo;ve spent my entire career using tools that compile down to CloudFormation and don&amp;rsquo;t mention it until something breaks. I&amp;rsquo;ve just never framed it that way.&lt;/p&gt;
&lt;p&gt;My career is a parade of progressively nicer frontends for the same underlying control plane — but one at a time.&lt;/p&gt;
&lt;p&gt;The first one was the AWS console. Click, wait, refresh, click. Then CloudFormation itself, which was an improvement in the way that a paper map is an improvement over asking for directions — technically correct, nearly unusable in practice. Then Serverless Framework, which promised to abstract the whole stack into a YAML file and a deploy command. Then Terraform, which promised cloud-agnostic infrastructure as code with a state model that actually worked.&lt;/p&gt;</description></item><item><title>from feature_flags import *</title><link>https://ferkakta.dev/from-feature-flags-import-star/</link><pubDate>Wed, 25 Mar 2026 21:00:00 -0500</pubDate><guid>https://ferkakta.dev/from-feature-flags-import-star/</guid><description>&lt;p&gt;A colleague needed a feature flag enabled on one tenant. &lt;code&gt;FEATURE_FLAG_ENABLE_AGENTS=True&lt;/code&gt; — one environment variable, one pod. I added it to the K8s secret manually, restarted the pod, and he was unblocked in two minutes.&lt;/p&gt;
&lt;p&gt;Then I realized: the next terraform apply would overwrite that secret without the flag. The ExternalSecret syncs from SSM, and the flag wasn&amp;rsquo;t in SSM through any path terraform knew about. My manual fix had a shelf life of one deploy.&lt;/p&gt;</description></item><item><title>The Allow SCP that worked until it didn't</title><link>https://ferkakta.dev/scp-allow-overrides-notaction-deny/</link><pubDate>Tue, 24 Mar 2026 21:00:00 -0600</pubDate><guid>https://ferkakta.dev/scp-allow-overrides-notaction-deny/</guid><description>&lt;p&gt;I run a multi-tenant SaaS platform on AWS with Control Tower managing the organization. Control Tower deploys a region deny guardrail — an SCP that blocks API calls outside your home region. The mechanism is a &lt;code&gt;NotAction&lt;/code&gt; deny: it lists services that are allowed to operate globally (IAM, CloudFront, Route 53, a few dozen others), and denies everything else when &lt;code&gt;aws:RequestedRegion&lt;/code&gt; doesn&amp;rsquo;t match your approved list.&lt;/p&gt;
&lt;p&gt;This guardrail is one of the first things you hit when you try to do anything interesting. And the documentation says you can&amp;rsquo;t override a deny with an allow.&lt;/p&gt;</description></item><item><title>The $233 Day, Part 2: The Inference Iceberg</title><link>https://ferkakta.dev/233-dollar-day-part-2/</link><pubDate>Fri, 20 Mar 2026 17:00:00 -0500</pubDate><guid>https://ferkakta.dev/233-dollar-day-part-2/</guid><description>&lt;p&gt;I posted the part 1 findings to the team thread — model switch, cache invalidation, 20× call volume, $173 training run. Case closed. The numbers were clean, the explanation was satisfying, and the model got reverted within the hour.&lt;/p&gt;
&lt;p&gt;Except $173 was wrong. Not wrong in the analysis — the training run did cost that much. Wrong in scope. I&amp;rsquo;d found the visible part of the spend and stopped looking.&lt;/p&gt;</description></item><item><title>The $173 Training Run</title><link>https://ferkakta.dev/173-dollar-training-run/</link><pubDate>Fri, 20 Mar 2026 15:00:00 -0500</pubDate><guid>https://ferkakta.dev/173-dollar-training-run/</guid><description>&lt;p&gt;The Slack message landed at 3pm on a Wednesday: &amp;ldquo;model training successful, previously 20min, now 1h30m.&amp;rdquo; I had finished an EKS 1.32-to-1.33 upgrade on the ramparts cluster that morning. My upgrade, my timeline, my problem.&lt;/p&gt;
&lt;p&gt;The first theory wrote itself. New cluster version, fresh nodes, cold image caches. I&amp;rsquo;d fixed a broken cluster autoscaler earlier that day — the old autoscaler deployment was pinned to a node selector that no longer matched after the upgrade, so pods were stacking up in Pending until I caught it. First-run penalties after a major version bump are real. Everyone on the call nodded. I almost typed up that explanation and moved on.&lt;/p&gt;</description></item><item><title>Your employees are tenants and you should bill them like it</title><link>https://ferkakta.dev/employees-as-tenants/</link><pubDate>Mon, 16 Mar 2026 14:00:00 -0600</pubDate><guid>https://ferkakta.dev/employees-as-tenants/</guid><description>&lt;p&gt;I built a Lambda that enriches every Bedrock invocation with cost data and routes it to per-tenant CloudWatch log groups. Model ID, input tokens, output tokens, estimated cost in USD, all written to &lt;code&gt;/bedrock/tenants/{tenant}&lt;/code&gt; so each customer&amp;rsquo;s AI spend is visible in near-real-time.&lt;/p&gt;
&lt;p&gt;Then a developer on the team needed Bedrock access for local development, and I had a problem I hadn&amp;rsquo;t anticipated.&lt;/p&gt;
&lt;h2 id="the-invisible-burn"&gt;The invisible burn&lt;/h2&gt;
&lt;p&gt;The developer&amp;rsquo;s use case was reasonable. He was building features against the Bedrock API and needed to iterate against real models, not mocks. I created an SSO permission set with &lt;code&gt;bedrock:InvokeModel&lt;/code&gt; and handed him the profile name.&lt;/p&gt;</description></item><item><title>I assumed GovCloud was AWS with a different region code. It took two weeks to prove me wrong.</title><link>https://ferkakta.dev/govcloud-surprises/</link><pubDate>Wed, 11 Mar 2026 23:00:00 -0400</pubDate><guid>https://ferkakta.dev/govcloud-surprises/</guid><description>&lt;p&gt;I needed a GovCloud account for a multi-tenant NIST compliance platform. I&amp;rsquo;d been running commercial AWS infrastructure for months — EKS, Terraform, tenant provisioning, the whole stack. GovCloud would be the same thing in a different region. That was the assumption. It lasted about four hours.&lt;/p&gt;
&lt;h2 id="the-account-that-doesnt-exist-yet"&gt;The account that doesn&amp;rsquo;t exist yet&lt;/h2&gt;
&lt;p&gt;My management account couldn&amp;rsquo;t call &lt;code&gt;CreateGovCloudAccount&lt;/code&gt;. The API returned &lt;code&gt;ConstraintViolationException&lt;/code&gt; with a message about not being &amp;ldquo;enabled for access to GovCloud&amp;rdquo; and no guidance on what that meant. I filed a support case. AWS enabled the permission two days later, and as a side effect created a standalone GovCloud account that had no relationship to my Organizations structure — an orphan floating in the partition with disconnected root credentials. I still had to find it and deal with it.&lt;/p&gt;</description></item><item><title>I debugged a Lambda timeout for 6 hours. The fix was 4 CLI commands.</title><link>https://ferkakta.dev/lambda-timeout-forensic-arc/</link><pubDate>Wed, 11 Mar 2026 16:00:00 -0400</pubDate><guid>https://ferkakta.dev/lambda-timeout-forensic-arc/</guid><description>&lt;p&gt;The ticket said the Lambda tracer was timing out. The Slack thread said &lt;code&gt;ConnectTimeoutError&lt;/code&gt; to an internal tracing endpoint. Four Lambda functions had been moved into a VPC the day before so they could reach &lt;code&gt;tracer.internal.ferkakta.net&lt;/code&gt; — an internal ALB at &lt;code&gt;10.x.x.x&lt;/code&gt;, only reachable from inside the VPC. The migration was verified, the API returned success, the ticket should not have existed.&lt;/p&gt;
&lt;p&gt;The people who built this system had moved on to other projects. The people using it were in a different timezone. There was no architecture doc, no runbook, no one to pair with. I had CloudWatch, a kubectl context, and AWS credentials.&lt;/p&gt;</description></item><item><title>Zero-touch multi-tenant deploys: removing myself from the critical path</title><link>https://ferkakta.dev/zero-touch-multi-tenant-deploys-eks-terraform/</link><pubDate>Mon, 02 Mar 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/zero-touch-multi-tenant-deploys-eks-terraform/</guid><description>&lt;p&gt;I had provisioned two tenants when I realized the deploy process didn&amp;rsquo;t scale to three. Each tenant on &lt;a href="https://ramparts.dev"&gt;ramparts&lt;/a&gt; runs three services &amp;ndash; &lt;code&gt;api-server&lt;/code&gt;, &lt;code&gt;web-client&lt;/code&gt; (the React frontend), &lt;code&gt;tenant-auth&lt;/code&gt; &amp;ndash; each with its own Docker image in ECR. Deploying a release meant running &lt;code&gt;gh workflow run deploy-tenant.yml -f tenant_name=acme -f action=apply -f update_images=true&lt;/code&gt;, then doing it again for the next tenant. With 3 services resolving per run and N tenants, I was the bottleneck. Not Terraform, not GitHub Actions, not ECR. Me, remembering which tenants existed and typing their names correctly.&lt;/p&gt;</description></item><item><title>Per-Tenant CloudWatch Log Isolation on EKS, or: Why I Stopped Using aws-for-fluent-bit</title><link>https://ferkakta.dev/per-tenant-cloudwatch-log-isolation-eks/</link><pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/per-tenant-cloudwatch-log-isolation-eks/</guid><description>&lt;h2 id="the-starting-assumption"&gt;The starting assumption&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;m building &lt;a href="https://ramparts.dev"&gt;ramparts&lt;/a&gt;, a multi-tenant compliance platform running on EKS. Each tenant gets a Kubernetes namespace &amp;ndash; &lt;code&gt;tenant-acme&lt;/code&gt;, &lt;code&gt;tenant-globex&lt;/code&gt;, whatever &amp;ndash; and the compliance controls require that their application logs land in isolated storage with 365-day retention. CMMC maps this to AU-2 (audit events), AU-3 (audit content), AU-11 (retention), and AC-4 (information flow isolation). A tenant cannot read another tenant&amp;rsquo;s container output.&lt;/p&gt;
&lt;p&gt;The obvious first move was &lt;code&gt;aws-for-fluent-bit&lt;/code&gt;, AWS&amp;rsquo;s own Helm chart and container image for shipping logs to CloudWatch. AWS service, AWS chart, AWS logging destination. The blessed path.&lt;/p&gt;</description></item><item><title>Why we removed aws-for-fluent-bit from EKS</title><link>https://ferkakta.dev/why-we-removed-aws-for-fluent-bit-from-eks/</link><pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/why-we-removed-aws-for-fluent-bit-from-eks/</guid><description>&lt;p&gt;We deployed &lt;code&gt;aws-for-fluent-bit&lt;/code&gt; because AWS recommends it.&lt;/p&gt;
&lt;p&gt;If you follow the EKS logging documentation, that&amp;rsquo;s the default path. It assumes you use AWS&amp;rsquo;s distribution of Fluent Bit rather than the upstream Helm chart.&lt;/p&gt;
&lt;p&gt;We did.&lt;/p&gt;
&lt;p&gt;Two days later, we ripped it out.&lt;/p&gt;
&lt;p&gt;The AWS chart and the upstream chart are not the same thing. The differences aren&amp;rsquo;t cosmetic. They affect how quickly you receive security patches, how transparently your configuration maps to the underlying plugin, and how many boundaries sit between your logs and the CloudWatch API.&lt;/p&gt;</description></item><item><title>Stop copying AWS managed policies — deny what you don't want instead</title><link>https://ferkakta.dev/iam-deny-overlay-managed-policies/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/iam-deny-overlay-managed-policies/</guid><description>&lt;p&gt;I needed to give a developer full CloudWatch read access — metrics, alarms, dashboards, log groups — but deny access to three categories of log groups containing security-sensitive data: WorkSpaces OS event logs, VPC flow logs, and WAF request logs.&lt;/p&gt;
&lt;p&gt;The reflex is to copy &lt;code&gt;CloudWatchReadOnlyAccess&lt;/code&gt; into a custom policy and delete the parts you don&amp;rsquo;t want. I&amp;rsquo;ve seen this in every organization I&amp;rsquo;ve worked in. It produces a policy with 50+ actions that you now own. Every time AWS ships a new CloudWatch feature, your policy is stale. You won&amp;rsquo;t update it. It&amp;rsquo;ll rot.&lt;/p&gt;</description></item><item><title>The IAM policy controls access — the document controls how people feel about it</title><link>https://ferkakta.dev/access-control-docs-as-respect/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/access-control-docs-as-respect/</guid><description>&lt;p&gt;I tightened a teammate&amp;rsquo;s AWS permissions last night. Added an inline deny policy to block three categories of CloudWatch log groups — WorkSpaces OS logs, VPC flow logs, WAF request data. Five minutes of IAM work. Then I spent twenty minutes writing a document explaining every boundary, what&amp;rsquo;s accessible, what&amp;rsquo;s denied, what&amp;rsquo;s coming next, and what I haven&amp;rsquo;t designed yet.&lt;/p&gt;
&lt;p&gt;The document mattered more than the policy.&lt;/p&gt;
&lt;h2 id="the-default-is-silence"&gt;The default is silence&lt;/h2&gt;
&lt;p&gt;Most companies handle access control the same way. Someone asks for access. An admin creates a policy. The requester gets a login link. Nobody explains what they can and can&amp;rsquo;t do, or why.&lt;/p&gt;</description></item><item><title>IAM trust policies silently accept wildcards in principals — and silently deny everything</title><link>https://ferkakta.dev/iam-trust-policy-wildcards/</link><pubDate>Thu, 26 Feb 2026 10:00:00 -0600</pubDate><guid>https://ferkakta.dev/iam-trust-policy-wildcards/</guid><description>&lt;p&gt;I needed a cross-account IAM role in a management account that workloads in a separate devops account could assume to send email via SES. Two types of callers: one shared service with a stable role name, and N dynamically-created per-tenant roles following a naming convention like &lt;code&gt;myapp-apiserver-*&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The shared service was straightforward — exact ARN in the trust policy principal. For the per-tenant roles, I wrote what looked correct:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-json" data-lang="json"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;Principal&amp;#34;&lt;/span&gt;&lt;span style="color:#960050;background-color:#1e0010"&gt;:&lt;/span&gt; { &lt;span style="color:#f92672"&gt;&amp;#34;AWS&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;arn:aws:iam::111111111111:role/myapp-apiserver-*&amp;#34;&lt;/span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;terraform apply&lt;/code&gt; succeeded. The role was created. Every assume-role call was denied.&lt;/p&gt;</description></item><item><title>IAM eventual consistency is 4 seconds — if your policy still doesn't work, you have a bug</title><link>https://ferkakta.dev/iam-eventual-consistency-is-four-seconds/</link><pubDate>Thu, 26 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/iam-eventual-consistency-is-four-seconds/</guid><description>&lt;p&gt;I changed an IAM inline policy on a role — added an &lt;code&gt;sts:AssumeRole&lt;/code&gt; statement so a pod could assume a cross-account SES role. Ran &lt;code&gt;terraform apply&lt;/code&gt;. Checked the policy with &lt;code&gt;get-role-policy&lt;/code&gt;. The old policy came back. No new statement.&lt;/p&gt;
&lt;p&gt;I said &amp;ldquo;propagation delay&amp;rdquo; and moved on to other work.&lt;/p&gt;
&lt;p&gt;Twenty minutes later I checked again. Same old policy. That&amp;rsquo;s not propagation.&lt;/p&gt;
&lt;h2 id="what-eventual-consistency-actually-means"&gt;What eventual consistency actually means&lt;/h2&gt;
&lt;p&gt;AWS IAM uses a distributed computing model. Changes to policies, roles, and credentials take time to replicate across endpoints. AWS documents this explicitly and recommends not including IAM changes in critical code paths.&lt;/p&gt;</description></item><item><title>The Over-Mighty Subject: why your site repos have too much power</title><link>https://ferkakta.dev/over-mighty-subjects-terraform-credential-scope/</link><pubDate>Thu, 26 Feb 2026 00:00:00 +0000</pubDate><guid>https://ferkakta.dev/over-mighty-subjects-terraform-credential-scope/</guid><description>&lt;p&gt;Josh Marshall &lt;a href="https://talkingpointsmemo.com/edblog/elon-musk-and-the-the-threat-of-the-over-mighty-subject-part-i/sharetoken/21fb0dac-112d-4a9d-bcc0-f5bf844b16bb"&gt;borrows a phrase from medieval history&lt;/a&gt; to describe a modern political problem: the Over-Mighty Subject. A feudal lord whose personal wealth, private army, and territorial control grew so large that he rivaled the crown itself. Not a rebel — still nominally a subject — but operating with enough independent power that the sovereign&amp;rsquo;s authority became theoretical.&lt;/p&gt;
&lt;p&gt;I had three of them in my infrastructure. They were Terraform roots for static sites.&lt;/p&gt;</description></item><item><title>I replaced $489/mo in AWS Client VPN with a $3 t4g.nano running Headscale</title><link>https://ferkakta.dev/headscale-aws-open-source-terraform-module/</link><pubDate>Sat, 21 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/headscale-aws-open-source-terraform-module/</guid><description>&lt;p&gt;A finops sprint surfaced $489/mo in AWS Client VPN charges. Three endpoints across two accounts, plus connection-hour fees. For a VPN that four people used. I had provisioned two of them.&lt;/p&gt;
&lt;p&gt;At the time, they felt indispensable — secure customer access, familiar tooling, predictable behavior.
In reality, they were architectural inertia.&lt;/p&gt;
&lt;p&gt;I replaced all three with a single t4g.nano running &lt;a href="https://github.com/juanfont/headscale"&gt;Headscale&lt;/a&gt; — the open-source Tailscale coordination server. Total cost: ~$3/mo.&lt;/p&gt;
&lt;p&gt;I genericized the Terraform and open-sourced the module.&lt;/p&gt;</description></item><item><title>Cross-repo auto-deploy with GitHub Actions: the orchestrator pattern</title><link>https://ferkakta.dev/cross-repo-auto-deploy-orchestration-github-actions/</link><pubDate>Fri, 20 Feb 2026 10:00:00 -0500</pubDate><guid>https://ferkakta.dev/cross-repo-auto-deploy-orchestration-github-actions/</guid><description>&lt;p&gt;Two repos merged within seconds of each other. The first orchestrator run failed — &lt;code&gt;web-client&lt;/code&gt;&amp;rsquo;s ECR image didn&amp;rsquo;t exist yet because the build was still running. The GitHub Actions log showed a red X, an error annotation, and a Slack notification I didn&amp;rsquo;t need to read.&lt;/p&gt;
&lt;p&gt;Four minutes later, the second run deployed both changes. No retry logic. No manual intervention. Nobody touched anything.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;d spent my day building a cross-repo deploy pipeline for a multi-tenant platform — three app repos pushing Docker images to ECR, one infra repo deploying the new tenant service images to EKS. The race condition was the first real test. It failed exactly the way I wanted it to.&lt;/p&gt;</description></item><item><title>Your CI/CD dispatch token can rewrite your infrastructure code</title><link>https://ferkakta.dev/github-actions-repository-dispatch-contents-write-permission/</link><pubDate>Fri, 20 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/github-actions-repository-dispatch-contents-write-permission/</guid><description>&lt;p&gt;I built a cross-repo auto-deploy pipeline this week. Three app repos push Docker images to ECR, then dispatch a deploy event to the infra repo&amp;rsquo;s orchestrator workflow via &lt;code&gt;repository_dispatch&lt;/code&gt;. Standard pattern.&lt;/p&gt;
&lt;p&gt;The gotcha: fine-grained PATs need &lt;code&gt;contents:write&lt;/code&gt; to call the &lt;code&gt;repository_dispatch&lt;/code&gt; API. Not &lt;code&gt;actions:write&lt;/code&gt; — &lt;code&gt;contents:write&lt;/code&gt;. The permission that also lets you push code, create branches, and delete files.&lt;/p&gt;
&lt;p&gt;My service token that should only be able to say &amp;ldquo;hey, deploy this&amp;rdquo; can also rewrite the deployment workflow it&amp;rsquo;s triggering. That&amp;rsquo;s not least privilege. That&amp;rsquo;s a door that&amp;rsquo;s three sizes too wide.&lt;/p&gt;</description></item><item><title>Your terraform apply is silently rolling back your container images</title><link>https://ferkakta.dev/state-aware-ecr-image-resolution-github-actions/</link><pubDate>Tue, 17 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/state-aware-ecr-image-resolution-github-actions/</guid><description>&lt;p&gt;Every &amp;ldquo;deploy to EKS with GitHub Actions&amp;rdquo; tutorial solves the same problem: build an image, push to ECR, deploy it. The tutorial ends at &amp;ldquo;your pod is running.&amp;rdquo; Nobody talks about day two.&lt;/p&gt;
&lt;h2 id="the-silent-rollback"&gt;The silent rollback&lt;/h2&gt;
&lt;p&gt;Day two: you have a running EKS cluster with three services per tenant. You need to change an IAM policy. You open a PR, touch one line of Terraform, run &lt;code&gt;terraform apply&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Your IAM policy updates. Your container images also update — to whatever was hardcoded in &lt;code&gt;variables.tf&lt;/code&gt; as the default. That default was correct three months ago. Your services just rolled back to a three-month-old image and nobody noticed because the deployment succeeded.&lt;/p&gt;</description></item><item><title>Terraform module for multi-provider DNS: define once, deploy to Route53 + Cloudflare</title><link>https://ferkakta.dev/terraform-module-for-multi-provider-dns-define-once-deploy-to-route53--cloudflare/</link><pubDate>Mon, 16 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/terraform-module-for-multi-provider-dns-define-once-deploy-to-route53--cloudflare/</guid><description>&lt;p&gt;I manage 10 domains across Route53 and Cloudflare. When I set up &lt;a href="https://fizz.today/til-cloudflare-registrar-locks-your-nameservers-and-how-to-escape-with-multi-provider-dns/"&gt;multi-provider DNS&lt;/a&gt; on my first domain, every record had to be defined twice — once for each provider. The APIs are different enough that you can&amp;rsquo;t just copy-paste.&lt;/p&gt;
&lt;p&gt;The duplication got old fast. So I wrote a module.&lt;/p&gt;
&lt;h2 id="the-problem"&gt;The problem&lt;/h2&gt;
&lt;p&gt;Route53 and Cloudflare represent the same DNS data differently:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MX records&lt;/strong&gt;: Route53 bundles priority into the value string (&lt;code&gt;&amp;quot;10 mx1.example.com&amp;quot;&lt;/code&gt;). Cloudflare splits it into a separate &lt;code&gt;priority&lt;/code&gt; field.&lt;/p&gt;</description></item><item><title>ElastiCache auth-token to RBAC migration has a Terraform provider bug</title><link>https://ferkakta.dev/elasticache-auth-token-to-rbac-migration/</link><pubDate>Fri, 13 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/elasticache-auth-token-to-rbac-migration/</guid><description>&lt;p&gt;Needed to migrate a shared ElastiCache Redis cluster from a single auth token to per-user RBAC. Breaking change — every service on the cluster goes dark if you get the sequencing wrong.&lt;/p&gt;
&lt;h2 id="the-terraform-provider-bug"&gt;The Terraform provider bug&lt;/h2&gt;
&lt;p&gt;Step one: don&amp;rsquo;t touch the real cluster. Built a throwaway copy and ran the migration there first.&lt;/p&gt;
&lt;p&gt;Good thing — the Terraform AWS provider has a bug in the auth-token removal step. It tells you the auth token was removed. Updates its state file. The plan shows no changes. But the underlying API call silently fails. The token is still active on the cluster.&lt;/p&gt;</description></item><item><title>Amazon WorkSpaces are invisible to SSM and CloudWatch (and how to fix it)</title><link>https://ferkakta.dev/workspaces-ssm-cloudwatch-bootstrap/</link><pubDate>Thu, 12 Feb 2026 10:00:00 -0600</pubDate><guid>https://ferkakta.dev/workspaces-ssm-cloudwatch-bootstrap/</guid><description>&lt;p&gt;I spent an afternoon arguing with Windows about whether I was allowed to be root on a machine I created. Six hours and six layers of undocumented workarounds later, I got CMMC-compliant audit logging on a desktop that doesn&amp;rsquo;t know it exists.&lt;/p&gt;
&lt;h2 id="the-problem"&gt;The problem&lt;/h2&gt;
&lt;p&gt;WorkSpaces don&amp;rsquo;t show up in AWS Systems Manager. They&amp;rsquo;re not EC2 instances — no instance profile, no metadata endpoint, no identity. SSM Agent is pre-installed but thinks it&amp;rsquo;s nobody. CloudWatch Agent has no credentials and doesn&amp;rsquo;t know what region it&amp;rsquo;s in.&lt;/p&gt;</description></item><item><title>SimpleAD is Samba 4 — you can create users with ldapadd instead of ClickOps</title><link>https://ferkakta.dev/simplead-ldap-user-creation-terraform/</link><pubDate>Thu, 12 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/simplead-ldap-user-creation-terraform/</guid><description>&lt;p&gt;If you&amp;rsquo;ve tried to fully automate Amazon WorkSpaces provisioning with Terraform, you&amp;rsquo;ve hit the wall: SimpleAD has no AWS API for creating directory users.&lt;/p&gt;
&lt;h2 id="what-every-guide-tells-you"&gt;What every guide tells you&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Enable WorkDocs in the console, then use the WorkDocs API to create users&lt;/li&gt;
&lt;li&gt;Launch a domain-joined EC2 instance with RSAT tools and create users manually&lt;/li&gt;
&lt;li&gt;RDP into a Windows management machine and use the AD admin console&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All of these break the Terraform workflow. Everything is automated except the one step that creates the user your WorkSpace actually needs.&lt;/p&gt;</description></item><item><title>What building infrastructure for a startup actually looks like</title><link>https://ferkakta.dev/startup-infra-unglamorous-work/</link><pubDate>Wed, 11 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/startup-infra-unglamorous-work/</guid><description>&lt;p&gt;I spent a day doing the unglamorous infrastructure work that keeps a startup alive. Here&amp;rsquo;s everything that happened.&lt;/p&gt;
&lt;h2 id="morning-security-audit"&gt;Morning: security audit&lt;/h2&gt;
&lt;p&gt;Audited two EKS clusters for a K8s privilege escalation vulnerability. Found 9 service accounts with &lt;code&gt;cluster-admin&lt;/code&gt; that didn&amp;rsquo;t need it. Deleted two dead deployments — ArgoCD and Velero, both mine, both abandoned months ago. The rest are kubeflow components we can&amp;rsquo;t touch until 1.36 ships the fix in April.&lt;/p&gt;</description></item><item><title>90 AWS resources in 5 minutes — automating multi-tenant SaaS tenant lifecycle</title><link>https://ferkakta.dev/multi-tenant-saas-tenant-lifecycle/</link><pubDate>Tue, 10 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/multi-tenant-saas-tenant-lifecycle/</guid><description>&lt;p&gt;I recorded our entire tenant lifecycle — create, test, destroy — with no edits. Here&amp;rsquo;s what 5 minutes of infrastructure automation looks like when there are no tickets, no handoffs, and no &amp;ldquo;can someone set up the database.&amp;rdquo;&lt;/p&gt;
&lt;h2 id="what-happens-on-tenant-create"&gt;What happens on &lt;code&gt;tenant create&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;One GitHub Actions workflow backed by Terraform + a Kubernetes operator:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Validates the tenant name, resolves container images from the latest release branch&lt;/li&gt;
&lt;li&gt;Provisions ACM wildcard cert + Route53 DNS records&lt;/li&gt;
&lt;li&gt;Creates the &lt;code&gt;Tenant&lt;/code&gt; CRD → operator provisions PostgreSQL databases on shared RDS, seeds credentials to SSM&lt;/li&gt;
&lt;li&gt;Terraform deploys ExternalSecrets, Deployments, Ingress — 3 services per tenant&lt;/li&gt;
&lt;li&gt;SSM parameters auto-seeded: Redis credentials, auth URLs, signing keys — ~40 config values per tenant&lt;/li&gt;
&lt;li&gt;Zero static credentials anywhere — IRSA for everything, secrets injected at runtime from SSM via External Secrets Operator&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;About 5 minutes from nothing to 90 AWS resources and running pods.&lt;/p&gt;</description></item><item><title>Your ACM certificate request is a beacon — scanners are watching Certificate Transparency logs</title><link>https://ferkakta.dev/acm-certificate-transparency-scanners/</link><pubDate>Mon, 09 Feb 2026 09:00:00 -0600</pubDate><guid>https://ferkakta.dev/acm-certificate-transparency-scanners/</guid><description>&lt;p&gt;I accidentally exposed production secrets on a public endpoint. Here&amp;rsquo;s what happened and what I learned about Certificate Transparency.&lt;/p&gt;
&lt;h2 id="the-setup"&gt;The setup&lt;/h2&gt;
&lt;p&gt;We&amp;rsquo;re building a multi-tenant SaaS platform on EKS. During development, our Terraform module defaulted to &lt;code&gt;ealen/echo-server&lt;/code&gt; for three microservices — a lightweight HTTP server that echoes back request info. Seemed harmless.&lt;/p&gt;
&lt;p&gt;What I missed: echo-server echoes EVERYTHING. Every environment variable in the container, including ones injected from AWS SSM via External Secrets Operator. Database connection strings. Redis auth tokens. OAuth client secrets. Signing keys. A single unauthenticated &lt;code&gt;GET /&lt;/code&gt; returns it all as JSON.&lt;/p&gt;</description></item></channel></rss>