ferkakta.dev

FinOps portfolio: 71 tickets over 5 years

My first finops ticket was called “Optimize the AWS infrastcuture.” The typo is still there. That was 2021 — a one-person infrastructure team at a startup that didn’t have the word finops in its vocabulary and didn’t know it needed one.

Five years later I went looking for every cost-related ticket I’d ever created. I expected maybe thirty. I found 71, spread across 8 Jira projects, touching every layer of the stack from EBS volumes to LLM inference spend. Nobody asked me to create a finops practice. I just kept looking at the bill and refusing to pay for things that didn’t earn their keep.

When I pulled the wider Jira archive, this portfolio stopped looking like an isolated sprint and started looking like the explicit cost-and-governance slice of a much longer substrate line. Before there was a FinOps spike, there were DNS cutovers, VPN setups, RDS recreations, kubecost installs, spot-instance experiments, external DNS work, and tenant plumbing. The project names changed. The instinct did not.

The arc

The tickets tell a story if you read them in order. Not a strategy — a maturation. The 71-ticket set is the part where the company had enough shared vocabulary for me to call the work FinOps. The lineage is older: infrastructure hygiene became cost visibility, then right-sizing, then deletion, then governance, and finally architecture that makes future waste harder to create.

First came audits. “Audit our AWS costs.” “Quantify the cost of our current setup.” “Create test plan for spot instances.” I was trying to see the bill clearly, because nobody else was looking.

Then visibility. I installed kubecost in 2021 to get per-namespace cost attribution on EKS. Later I enabled the AWS Cost Optimization Hub and started exporting its recommendations as FOCUS-format Parquet to S3 — which I query with DuckDB instead of clicking through the console, because ClickOps is schlepping.

Then right-sizing. gp2 to gp3. io1 to gp3 ($198/mo on one RDS instance). Magnetic EBS volumes nobody remembered creating. 40 orphaned EBS volumes that AWS recommended I snapshot and delete. 25 orphaned target groups. A t2.micro running for years with no name and no owner ($8/mo). I converted 17 volumes to gp3 in one evening and knocked $39/mo off the bill before dinner.

Then elimination. Three AWS Client VPN endpoints that had accumulated over four years of incremental security decisions — $489/mo replaced by a $3 t4g.nano running Headscale. A Transit Gateway that never carried traffic. A SAML VPN, a cert-based VPN, a NAT gateway, a Simple AD directory — all deleted, all ticketed, all with dollar amounts. A Jenkins instance and its ALB ($55/mo). A second ALB nobody was using ($22/mo). An EKS cluster that was part of an abandoned upgrade path ($72/mo for the control plane alone). Ancient stopped instances from 2016 that were still paying for EBS volumes. A mystery AWS Data Pipeline that had been billing $1/mo from a service AWS itself had canceled — the console was gone, but the CLI still worked.

Then replacement architecture. The VPN replacement wasn’t just a deletion — it was a design decision. I replaced a managed service with open-source infrastructure, open-sourced the Terraform module, and published the post. The Bedrock log router went from per-tenant subscription filters to a shared Lambda — consolidation as architecture, not just as cost reduction.

The current estate looks the same to me. A Well-Architected Review, support-plan downgrades, GovCloud auth-handler parity, tenant onboarding cleanup, and redundant rebuild elimination are not separate chores. They are cost, governance, tenancy, and operability showing up as one system again.

The monitoring stack

I don’t check the bill once a quarter. I have three systems watching it for me.

MiserBot sends a daily Slack report showing spend changes. It installs as a CloudFormation stack — one IAM role with read-only access plus CUR write permissions, and Concurrency Labs does the rest. When MiserBot flagged a new line item in March 2026, I traced it to RDS entering extended support at $330/mo. I stopped everything and upgraded PostgreSQL 13 to 16 that week.

AWS daily spend budget alarms fire when the annualized daily spend (daily × 30.3) crosses a threshold. I add a new lower threshold every time I drive costs down. The alarm history is the progress narrative: $6,000/mo was the original. $4,000/mo was the first achievement — I wanted to know if it ever crept back. Then $3,000. Then $2,000. Each alarm is a ratchet that locks in the gains and makes regression visible.

AWS anomaly detection emails handle the rest. Combined with bitter experience — I’ve been burned by extended support charges more than once — they drive preemptive upgrades on RDS and EKS before AWS starts billing penalty rates.

The feedback loop: MiserBot flags a change → I open DuckDB → query the Cost Optimization Hub Parquet in S3 → investigate → act. During the January sprint, I ran the same query every morning and watched $716 in available savings drop to $490 as I worked through the recommendations. $226/mo captured in two weeks.

The savings plan that made itself unnecessary

I’d been renewing a 1-year Compute Savings Plan annually. Coverage reports showed 50% of our compute was covered in November, December, January — not good enough. I bought a 3-year plan to push coverage to ~82% and created a daily budget alarm: if coverage drops below 80%, I get an email.

The plan was to wait for the 1-year to expire, then stagger monthly purchases of 3-year Compute Savings Plans — replacing the outgoing annual capacity with cheaper three-year commitments, spreading the risk across months instead of one annual cliff.

Phase 1 of the finops spike eliminated so much obsolete compute that when the 1-year plan expired, there was no gap left to fill. I canceled it instead of renewing. The optimization had made the discount instrument unnecessary. That’s the best possible outcome — you don’t need the coupon because you stopped buying the thing.

The archaeology

In March I went back through an older product’s infrastructure — a platform from the company’s earlier life. g2.2xlarge GPU instances from an era when that was a reasonable instance type. CloudFormation stacks for deployment configurations nobody could name. CodeDeploy applications for services that hadn’t run in years. An RDS instance that AWS kept restarting for maintenance every week because it was “stopped temporarily” — I had to race them to delete it.

I deleted 71 EBS volumes from an abandoned cluster in a single session. I inventoried 4.7 TB of RDS snapshots in dev, some dating back to 2016. I stopped five running instances that predated Kubernetes. Every one of these was an ancestor of the current infrastructure — still breathing, still billing, invisible to anyone who didn’t go looking.

The discipline

Every optimization gets a ticket. Every ticket has a dollar amount where possible. Every threshold gets an alarm. I created a private Slack channel mid-sprint as an operational diary — daily logs of what I deleted, converted, or proposed, with volume IDs, DuckDB queries, and CLI output. I wish I’d created it years earlier.

The portfolio now spans 71 tickets across 8 Jira projects, 5 years, and a spend reduction from $6,000/mo to under $2,000/mo. Most of those tickets took less than an hour. Some took five minutes. A few — the VPN replacement, the EKS cluster consolidation proposal, the savings plan strategy — took real architectural thinking.

None of them required a finops team. They required an engineer who looked at the bill every day and kept asking: does this earn its keep?

The receipts

Everything above is the narrative. Below is the evidence — the daily diary entries from the sprint and the full ticket portfolio. Scroll if you want the story. Stay if you want the proof.

The January sprint — daily Slack diary

I kept a private channel during the sprint. Here’s what a month of daily finops work looks like when one person is doing it between other responsibilities:

DateWhat I did
02/10Deleted the AD directory ($32/mo)
02/11Created daily savings plan coverage budget (80% threshold). Bought 3-year savings plan. Coverage: 50% → ~82%
02/12Deleted mystery AWS Data Pipeline “Test” ($1/mo for years, from a canceled service). Set overall agenda: intelligent tiering, Graviton migrations, right-size everything, budget for everything
02/13Decommissioned two unused RDS instances
02/14Replaced x86 RDS instance with Graviton for 20% off
02/17Backed up AMI for GPU workstation, terminated stopped instance + 500GB volume ($50/mo)
02/18Removed EKS observability addon ($8/day anomaly alert) — will re-add scoped to specific namespaces
02/22Deleted 71 EBS volumes from abandoned EKS cluster in prod
02/26Cost Optimization Hub: $716 available savings → $490 ($226 captured). Proposed EBS lightning round: 18 volumes to gp3, 40 volumes to snapshot+delete ($138/mo). Proposed deleting unused EKS cluster ($72/mo). Proposed deleting unused AWS accounts. Killed CloudWatch logging (cost anomaly, not working). Converted 17 volumes to gp3 ($39/mo)
02/27Converted and deleted second tranche of volumes
02/28Applied $25 AWS customer council credit. Planned Textract cost optimization strategy
03/18Legacy platform archaeology: stopped 5 ancient running instances (g2.2xlarge, c4.xlarge, t2s). Inventoried 11 CloudFormation stacks, 20 CodeDeploy applications. Caught RDS instance before AWS restarted it for maintenance
03/22Inventoried 4.7 TB of RDS snapshots in dev (some from 2016)

The full ticket portfolio — 71 tickets, 8 projects, 2021–2026

Every ticket below is real. The project prefixes are sanitized, the dollar amounts are not. Read the “Done” column as a progress bar — most of the value has already been captured. The backlog items are either blocked by other teams, deliberately on hold, or waiting for the right moment.

Epics

TicketSummaryStatus
RM-434FinOps Spike: Reduce Ferkakta.net AWS from $4k to <$2kIn Progress
RM-465FinOps Phase 2: Cloud Governance — Give Every Dollar a JobBacklog
RM-176Cost reduction for infrastructureDone

Under Epic RM-434 (28 tickets)

TicketSummaryStatus$ Impact
RM-435Switch RDS app-db from io1 to gp3Done$198/mo
RM-436ECR cleanup: lifecycle policies + delete orphaned reposBacklog$200+/mo
RM-437Kill unused internal ALBDone$22/mo
RM-438Kill Jenkins-new instance + ALBDone$55/mo
RM-439Release 3 idle EIPsDone$15/mo
RM-440Kill unnamed t2.microDone$8/mo
RM-441Delete orphaned CloudWatch log groupsDone$5/mo
RM-442Apply S3 lifecycle policies to all bucketsDone$30/mo
RM-443Replace Textract with Claude Vision for COI extractionBacklog$400/mo
RM-444Disable GuardDuty EKS Runtime MonitoringDone$126/mo
RM-445Investigate Simple AD in us-west-2Done$37/mo
RM-446Migrate gp2 to gp3 EBS volumesDone$16/mo
RM-447Investigate Magnetic EBS volumesDone$28/mo
RM-448TGW to VPC Peering migrationBacklog$146/mo
RM-449Consolidate 2 EKS clusters to 1Backlog$148/mo
RM-450Delete 25 orphaned target groupsDonefree
RM-451Deploy Falco for K8s runtime security (replaces GuardDuty)Backlog
RM-452Release 3 idle EIPs in ap-south-1Done$11/mo
RM-453Delete cert-based VPNDone
RM-454Delete TGW (never functional - 0 traffic)Done
RM-455Deploy Headscale VPN (replace dev VPNs)Done
RM-456Delete SAML VPN (after Tailscale validated)Done
RM-457Delete orphaned EBS volumes (CloudWatch verified)Done
RM-458Delete unused NAT GatewayDone
RM-459Terminate ancient stopped instances (2016)Done
RM-460Delete orphaned Simple ADDone
RM-461RDS: Migrate remaining gp2 to gp3Done
RM-462Terminated abandoned spot instanceDone
RM-463Delete 2 orphaned EBS volumesDone
RM-464RDS Storage Right-sizingIn Progress$30/mo

VPN replacement arc (~$489/mo → $3/mo)

TicketSummaryStatus
RM-173Secure VPC resources behind VPN (original setup)Done
RM-394AWS VPN endpoints — disable endpoints not neededDone
RM-453Delete cert-based VPNDone
RM-455Deploy Headscale VPN (replace dev VPNs)Done
RM-456Delete SAML VPN (after Tailscale validated)Done

Cost Optimization Hub cluster (INF)

TicketSummaryStatus
RM-345Set Up Cost Optimization Hub in AWSDone
RM-346Follow Up on Cost Optimization Hub FindingsPending Review
RM-347Clean Up Unused Airsim InstancesDone
RM-348Optimize unused PostgreSQL RDS instanceDone
RM-349Review and cleanup unused MySQL RDS instancesDone
RM-371Convert EBS Volumes to gp3Done
RM-372Snapshot and Delete Underutilized EBS VolumesDone

Infrastructure cleanup

TicketSummaryStatus
RM-342Upgrade RDS PostgreSQL 12 before end of standard supportDone
RM-343Upgrade RDS MySQL before deprecationDone
RM-369Delete Sandbox Account in AWSPending Review
RM-370Delete unused EKS clusterDone
RM-465Upgrade PostgreSQL 13→16 ($330/mo extended support)Done

Scale-down / auto-scaling

TicketSummaryStatus
RM-77Create pipeline for scaling down deployments (60min timeout)Done
RM-78Create API for scale-down pipelineDone
RM-83Create resource for scaling up/down APIsDone
RM-377Fix scale-down-delay and scale-to-zero annotationsDone
RM-401Add Windows idle disconnect policy to WorkSpaces bootstrapDone

SaaS platform cost optimization

TicketSummaryStatus
RM-256Replace per-service log filters with shared router LambdaDone
RM-261EKS cluster scale-to-zero for cost optimizationDone
RM-262Validate full cluster teardown and rebuildObsolete
RM-442Fix EKS park/unpark Terraform min_size driftDone
RM-452Add pre-destroy step to delete ALB bootstrap ingressTo Do

Right-sizing

TicketSummaryStatus
RM-14Optimize the AWS infrastcutureDone
RM-37Install kubecostDone
RM-42Create test plan for spot instancesDone
RM-43Downsize EKS nodegroup from 5 to 3Done
RM-61Implement EBS cleanup strategyDone
RM-83Audit our AWS costsDone
RM-201Reduce EBS volume costDone
RM-415Explore cost reduction options with archera platformBacklog
RM-39Update EBS cleanup Jenkins jobs to delete PVCs directlyBacklog
RM-184Quantify the cost of our current setupDone
RM-313Change default instance type to cheapest availableDone
RM-436Self-hosted GHA runner on EKS Graviton for ARM64 buildsTo Do

Visibility & governance

TicketSummaryStatus
RM-189Enable tenant cost allocation tagObsolete
RM-204Enable cost allocation tags in management accountDone
RM-269Add finops tags to all Terraform root modulesTo Do
RM-270FinOps automation: expiry enforcement, cost sweepsTo Do

Spend controls (LLM)

TicketSummaryStatus
RM-472Lock down expensive LLM modelsBacklog
RM-473Per-model daily spend limits on LiteLLM proxyBacklog

Direct cost reduction

TicketSummaryStatus
RM-501Downgrade AWS support plansTo Do

#finops #aws #platformengineering #duckdb