ferkakta.dev

The zombie Java app serving internet scrapers for five years

I found it because of the load balancers.

Six of them on the dev account, all legacy. Four Classic ELBs from 2015–2016, two ALBs from 2017. I was auditing them as part of a finops sweep — checking whether anything still needed them. The answer was no. Five had zero requests in all of 2026. The sixth, an ALB called staging-passthru-ssl, had 150.

150 requests in three months is background noise. But I wanted to know what was making it. The ALB had four target groups: staging-web, oracle, oraclev2, oraclev3. Three were empty — no registered targets at all. oracle had one instance: i-0abc123def456ghi7, a t2.medium named oracle-api. It was unhealthy. The ALB was health-checking it and the health checks were failing.

I looked up the instance. Running. Launched August 2020. I checked CloudWatch for signs of life — CPU, network, disk.

CPU was at 27% average. Spiking to 57%. Network was pushing 500 MB a month in both directions. This wasn’t idle infrastructure. Something was running.

I SSHed in. The AMI was so old it had been deregistered from the catalog. The SSH key needed PubkeyAcceptedAlgorithms=+ssh-rsa because OpenSSH had dropped SHA-1 signatures since the key was created. The default user was ubuntu.

$ uptime
05:43:10 up 2073 days, 23:46, 0 users, load average: 1.10, 1.05, 1.02

2,073 days. Five years and eight months. Never rebooted.

ps aux told the rest of the story. One Java process, PID 1517, started in 2021, consuming 87% CPU and 34% of memory. It had accumulated 2,276,909 minutes of CPU time — 4.3 years of continuous burn. The command line was a Play Framework application with TensorFlow client libraries, running on JRE 1.8.0_25. That JRE was released in October 2014.

The app was listening on port 9000. I checked the security groups: 18 ports open to 0.0.0.0/0, including SSH, Elasticsearch (9200/9300), VNC (5901), and a grab bag of custom application ports. The instance had a public IP.

I ran ss to see who was talking to it.

Hundreds of established connections on port 9000. The top client had 138 concurrent connections from a single IP. Dozens more from datacenter and VPS ranges across the internet. Scrapers, bots, scanners — all hitting the TensorFlow inference endpoint directly via the public IP, bypassing the load balancer entirely.

The two internal connections? Both ENIs of the staging-passthru-ssl ALB. Its own health checks, failing.

There was also a Ruby 2.0 process running: the AWS CodeDeploy agent, making outbound HTTPS calls to AWS, polling every few seconds for deployment commands. It had been waiting for a deployment since at least 2021. None would ever come.

The container image behind this Lambda (the same inference pipeline ran on Lambda too) was 2.3 GB compressed across 29 layers. A 225 MB Debian build toolchain — gcc, g++, imagemagick — left in the runtime image because nobody wrote a multi-stage Dockerfile. 641 MB of Python dependencies. A 1.2 GB pre-baked model.

The bill: $34/month for the instance. $73 for four Classic ELBs. $33 for two ALBs. $2 for the EBS volume. $142 a month, $1,700 a year, to serve inference results to internet scrapers on a Java runtime from the Obama administration.

Nobody had logged in. Nobody had deployed to it. Nobody had checked on it. The monitoring that existed — CloudWatch metrics, ALB health checks — was screaming that the target was unhealthy, but nobody was listening. The instance just kept running, burning CPU, answering requests from strangers, and waiting for a deployment that would never arrive.

The tools optimize what runs. They don’t tell you what doesn’t. Cost Explorer shows you the line items but not the connections between them. A load balancer with zero requests is free signal. Follow it to the instance. Follow the instance to the process. Follow the process to the connections. The archaeology isn’t in the dashboard. It’s in ss and ps aux and the 2,073-day uptime counter that nobody ever looked at.

#aws #finops #archaeology #platformengineering