ferkakta.dev

Your terraform apply is silently rolling back your container images

Every “deploy to EKS with GitHub Actions” tutorial solves the same problem: build an image, push to ECR, deploy it. The tutorial ends at “your pod is running.” Nobody talks about day two.

The silent rollback

Day two: you have a running EKS cluster with three services per tenant. You need to change an IAM policy. You open a PR, touch one line of Terraform, run terraform apply.

Your IAM policy updates. Your container images also update — to whatever was hardcoded in variables.tf as the default. That default was correct three months ago. Your services just rolled back to a three-month-old image and nobody noticed because the deployment succeeded.

This happens because Terraform variables need a value. If your image_tag variable has a default, every apply uses that default unless you explicitly pass -var. If it doesn’t have a default, every apply demands image coordinates — even when you’re changing something that has nothing to do with images.

Both options are wrong.

The priority chain

The fix is a resolution order that asks: what does the operator actually intend?

Priority 1: Explicit image. The operator passed an image override. They know exactly what they want. Use it.

Priority 2: Terraform state. The operator didn’t pass an image. They’re making an infra change — IAM policy, env var, resource quota. Read the current image from state and pass it back to Terraform so nothing changes.

Priority 3: Release branch. No explicit image, no state (new deployment). Resolve from the latest release branch and deploy that.

explicit_image? ──yes──> parse and use
       │
       no
       │
       ▼
state has image? ──yes──> extract and use
       │
       no
       │
       ▼
resolve from release branch

The safety gate

The priority chain alone isn’t enough. You need a way to say “I actually do want to deploy new images” — otherwise you’re stuck on whatever image is in state forever.

update_images (boolean, default false). When false, priority 2 wins — state is truth. When true, skip priority 2, go to priority 3 — resolve from the latest release branch.

Two operations that used to be the same dangerous terraform apply:

Operationupdate_imagesImage sourceResult
IAM policy changefalse (default)stateIAM updates, images untouched
Deploy new releasetruerelease branchLatest images deployed
First deploymentfalse (default)release branch (no state)New service gets latest images

The default is safe. The intentional override is explicit.

Implementation: GitHub Actions composite action

I built this as a composite action (fizz/resolve-container-image). One invocation resolves one image. Call it N times for N services.

- name: Resolve API image
  id: api-image
  uses: fizz/resolve-container-image@v1
  with:
    registry_prefix: '123456789012.dkr.ecr.us-east-1.amazonaws.com'
    image_repo: vanguard-api
    github_repo: vanguard-api
    github_org: vanguard
    update_images: ${{ inputs.update_images }}
    terraform_resource_address: module.api.kubernetes_deployment_v1.service
    terraform_working_dir: terraform/services/api

- name: Terraform Apply
  run: |
    terraform apply -auto-approve \
      -var="api_image=${{ steps.api-image.outputs.repository }}" \
      -var="api_tag=${{ steps.api-image.outputs.tag }}" \
      -var="api_digest=${{ steps.api-image.outputs.digest }}"

Outputs: repository, tag, digest, full_image, source (one of explicit, state, release_branch). The source output goes into $GITHUB_STEP_SUMMARY so you can see where every image came from in the workflow run.

Reading images from state

The state read is the interesting part. terraform show -json dumps the full state as JSON. A recursive jq query finds the container image in the deployment spec at a given resource address:

IMAGE=$(terraform show -json | jq -r '
  .. | objects | select(.containers) |
  .containers[] | select(.name == "'"$SERVICE"'") |
  .image // empty
' 2>/dev/null | head -1)

This works with any Kubernetes provider resource that has a container spec — deployments, stateful sets, daemon sets.

Always pin the digest

Tags are mutable. Someone can push a new image to main-42 and your “immutable” deployment changes without any code diff. Always pin the digest:

123456789012.dkr.ecr.us-east-1.amazonaws.com/vanguard-api:main-42@sha256:abc123...

The composite action resolves the digest from ECR via a Python script with aws-error-utils for specific exception handling. If the tag exists in ECR, it has a digest. If it doesn’t exist, that’s not a “missing digest” — that’s a missing image, and the workflow fails with a clear error message.

Release branch resolution

When there’s no state and no explicit image (new deployment), the action discovers the latest releases/* branch via the GitHub API, gets the HEAD SHA short hash, and looks for that tag in ECR:

BRANCH=$(gh api "repos/$ORG/$REPO/branches" --jq '
  [.[] | select(.name | startswith("releases/"))] |
  sort_by(.name) | last | .name
')
SHA=$(gh api "repos/$ORG/$REPO/git/ref/heads/$BRANCH" --jq '.object.sha[:7]')
TAG="${BRANCH##*/}-${SHA}"

The CLI wrapper

Nobody wants to type -f update_images=true -f release_branch=releases/1.2 into a gh workflow run command. A shell wrapper makes the intent clear:

tenant update acme                         # latest release images
tenant update acme --branch releases/1.2   # specific release branch
tenant plan acme                           # infra diff only, images from state

tenant update dispatches with update_images=true. tenant plan dispatches with update_images=false. Same workflow, different intent, zero ambiguity.

Why nobody’s written about this

I searched. There are hundreds of “push to ECR” tutorials and actions. Dozens of composite action examples. Zero results for state-aware image resolution or separating infra ops from image deployments in the same Terraform workspace.

The gap exists because most tutorials stop at day one. They assume you’re always deploying. In practice, most terraform apply runs are infra changes — IAM policies, env vars, resource quotas, DNS records. Image deployments are a small fraction of applies, and they should be intentional.

The pattern isn’t complicated. It’s a priority chain with a safety gate. But nobody’s named it, and the default behavior of every Terraform EKS tutorial is wrong.

#aws #github-actions #terraform #ecr #eks #ci-cd