I debugged a Lambda timeout for 6 hours. The fix was 4 CLI commands.

2026-03-11

The ticket said the Lambda tracer was timing out. The Slack thread said ConnectTimeoutError to an internal tracing endpoint. Four Lambda functions had been moved into a VPC the day before so they could reach tracer.internal.ferkakta.net — an internal ALB at 10.x.x.x, only reachable from inside the VPC. The migration was verified, the API returned success, the ticket should not have existed.

The people who built this system had moved on to other projects. The people using it were in a different timezone. There was no architecture doc, no runbook, no one to pair with. I had CloudWatch, a kubectl context, and AWS credentials.

I started where anyone would: the network. I built a test Lambda in the same VPC, same subnets, same security group. DNS resolved to two private IPs. TCP connected in 9ms. TLS handshake in 38ms. HTTPS POST to /v1/traces returned in 141ms. The network path was clean.

So I went to the logs. CloudWatch START lines include the Lambda version, and I hadn’t been paying attention to that field:

START RequestId: abc-123 Version: 1

Not $LATEST. Version 1. I checked the function configuration with --qualifier 1 and the VPC config was empty. Version 1 was published three days before the migration — a frozen snapshot of the pre-VPC configuration. update-function-configuration had updated $LATEST, and $LATEST had the correct VPC config. But the production caller was invoking :1, and :1 didn’t know about the VPC at all.

This explained why Monday’s verification passed. aws lambda invoke without a qualifier hits $LATEST. So does the console. I tested the right function with the wrong version and called it done. The batch inference job the next morning hit the right version — the one I didn’t test — and 69 traces failed.

I published Version 2 from $LATEST. But publishing a new version doesn’t change what callers invoke. Something still needed to point at :2 instead of :1. The question was where the version reference lived, and nobody was around to tell me.

The Lambdas aren’t invoked by name. The inference pipeline stores API Gateway URLs in a Postgres deployment table. I didn’t have a port-forward to the database, and there was no wiki page explaining the schema. But the pods and the RDS instance share a VPC, so I spun up an ephemeral psql container and queried the table directly:

kubectl run psql-probe --image=postgres:15-alpine \
  --restart=Never --rm -i -- \
  sh -c 'PGPASSWORD=*** psql -h rds-host -U postgres -d kerneldb \
  -c "SELECT id, service_details->'"'"'url'"'"' FROM deployment"'

The database stores API Gateway URLs. Not Lambda ARNs, not version numbers. Clean indirection — the version isn’t here.

I inspected the API Gateway integration:

aws apigateway get-integration \
  --rest-api-id abc123 --resource-id xyz789 \
  --http-method ANY --query 'uri'

The URI ended in :1/invocations. The deployment pipeline had baked the version number into the API Gateway integration at creation time. The API Gateway name confirmed it — dev-lambda-my-func:1-API — and so did its tags. The version was written on the resource in three places and visible in none of the usual diagnostic paths.

There was a second invocation path I hadn’t considered. I didn’t have access to the source repo — it lived somewhere in a private Git org, under a name I couldn’t guess. But the pods were running. I kubectl exec’d into the inference container and read the code off disk. Somewhere in main.py I found url2arn():

def url2arn(url: str) -> Optional[str]:
    client = boto3.client("apigateway")
    rest_api_id = urlparse(url).netloc.split(".")[0]
    for resource in client.get_resources(restApiId=rest_api_id)["items"]:
        if resource["path"] == "/{proxy+}":
            integration = client.get_integration(
                restApiId=rest_api_id,
                resourceId=resource["id"],
                httpMethod="ANY",
            )
            arn = integration["uri"].split("/")[-2]
            return arn

This function reads the API Gateway integration at runtime and extracts the Lambda ARN — including the :1 qualifier. So the stale version propagated through both the HTTP path (client → API Gateway → Lambda:1) and the direct-invoke path (client → url2arn() → boto3 → Lambda:1). Two invocation paths, one root cause, and I’d been looking at the wrong layer for most of the afternoon.

The actual fix was small. For each of the two dev Lambdas: update the API Gateway integration URI from :1 to :2, grant lambda:InvokeFunction permission on Version 2 (published versions don’t inherit resource policies — another silent gap), and redeploy the stage. Three commands per Lambda. CloudWatch immediately showed Version: 2, zero ConnectTimeoutError, entity matching responding in 0.3 seconds.

What I keep coming back to is the verification. I tested the migration by invoking $LATEST, which is what every default invocation path does. The Lambda console, the CLI, the SDK — they all default to $LATEST. The production system was the only caller that specified a version qualifier, and it was the one path I didn’t test. The migration “worked” in every environment except the one that mattered.

The deeper problem is that version numbers get baked into API Gateway integrations at creation time and never get touched again. The deployment pipeline creates the Lambda, publishes a version, builds the API Gateway integration with that version in the URI, and moves on. Every subsequent update-function-configuration modifies $LATEST and leaves the integration pointing at the original frozen version. The version pin is invisible until you call get-integration, which isn’t part of any standard debugging workflow I’ve seen.

While I was reading code off that running container, I found a scale_lambda() function 50 lines away from the integration code. It creates aliases for provisioned concurrency. If the deployment path used an alias instead of a version number — my-func:live instead of my-func:1 — the API Gateway integration would survive configuration changes, version publishes, and VPC migrations without anyone touching it. update-alias is one command. It would have prevented the bug entirely, and it would have prevented the 6 hours I spent finding it.

The pattern was already in the codebase. It just outlived the people who would have wired it in.

#aws #lambda #api-gateway #debugging #vpc #platform-engineering