Your terraform apply is silently rolling back your container images
Every “deploy to EKS with GitHub Actions” tutorial solves the same problem: build an image, push to ECR, deploy it. The tutorial ends at “your pod is running.” Nobody talks about day two.
The silent rollback
Day two: you have a running EKS cluster with three services per tenant. You need to change an IAM policy. You open a PR, touch one line of Terraform, run terraform apply.
Your IAM policy updates. Your container images also update — to whatever was hardcoded in variables.tf as the default. That default was correct three months ago. Your services just rolled back to a three-month-old image and nobody noticed because the deployment succeeded.
This happens because Terraform variables need a value. If your image_tag variable has a default, every apply uses that default unless you explicitly pass -var. If it doesn’t have a default, every apply demands image coordinates — even when you’re changing something that has nothing to do with images.
Both options are wrong.
The priority chain
The fix is a resolution order that asks: what does the operator actually intend?
Priority 1: Explicit image. The operator passed an image override. They know exactly what they want. Use it.
Priority 2: Terraform state. The operator didn’t pass an image. They’re making an infra change — IAM policy, env var, resource quota. Read the current image from state and pass it back to Terraform so nothing changes.
Priority 3: Release branch. No explicit image, no state (new deployment). Resolve from the latest release branch and deploy that.
explicit_image? ──yes──> parse and use
│
no
│
▼
state has image? ──yes──> extract and use
│
no
│
▼
resolve from release branch
The safety gate
The priority chain alone isn’t enough. You need a way to say “I actually do want to deploy new images” — otherwise you’re stuck on whatever image is in state forever.
update_images (boolean, default false). When false, priority 2 wins — state is truth. When true, skip priority 2, go to priority 3 — resolve from the latest release branch.
Two operations that used to be the same dangerous terraform apply:
| Operation | update_images | Image source | Result |
|---|---|---|---|
| IAM policy change | false (default) | state | IAM updates, images untouched |
| Deploy new release | true | release branch | Latest images deployed |
| First deployment | false (default) | release branch (no state) | New service gets latest images |
The default is safe. The intentional override is explicit.
Implementation: GitHub Actions composite action
I built this as a composite action (fizz/resolve-container-image). One invocation resolves one image. Call it N times for N services.
- name: Resolve API image
id: api-image
uses: fizz/resolve-container-image@v1
with:
registry_prefix: '123456789012.dkr.ecr.us-east-1.amazonaws.com'
image_repo: vanguard-api
github_repo: vanguard-api
github_org: vanguard
update_images: ${{ inputs.update_images }}
terraform_resource_address: module.api.kubernetes_deployment_v1.service
terraform_working_dir: terraform/services/api
- name: Terraform Apply
run: |
terraform apply -auto-approve \
-var="api_image=${{ steps.api-image.outputs.repository }}" \
-var="api_tag=${{ steps.api-image.outputs.tag }}" \
-var="api_digest=${{ steps.api-image.outputs.digest }}"
Outputs: repository, tag, digest, full_image, source (one of explicit, state, release_branch). The source output goes into $GITHUB_STEP_SUMMARY so you can see where every image came from in the workflow run.
Reading images from state
The state read is the interesting part. terraform show -json dumps the full state as JSON. A recursive jq query finds the container image in the deployment spec at a given resource address:
IMAGE=$(terraform show -json | jq -r '
.. | objects | select(.containers) |
.containers[] | select(.name == "'"$SERVICE"'") |
.image // empty
' 2>/dev/null | head -1)
This works with any Kubernetes provider resource that has a container spec — deployments, stateful sets, daemon sets.
Always pin the digest
Tags are mutable. Someone can push a new image to main-42 and your “immutable” deployment changes without any code diff. Always pin the digest:
123456789012.dkr.ecr.us-east-1.amazonaws.com/vanguard-api:main-42@sha256:abc123...
The composite action resolves the digest from ECR via a Python script with aws-error-utils for specific exception handling. If the tag exists in ECR, it has a digest. If it doesn’t exist, that’s not a “missing digest” — that’s a missing image, and the workflow fails with a clear error message.
Release branch resolution
When there’s no state and no explicit image (new deployment), the action discovers the latest releases/* branch via the GitHub API, gets the HEAD SHA short hash, and looks for that tag in ECR:
BRANCH=$(gh api "repos/$ORG/$REPO/branches" --jq '
[.[] | select(.name | startswith("releases/"))] |
sort_by(.name) | last | .name
')
SHA=$(gh api "repos/$ORG/$REPO/git/ref/heads/$BRANCH" --jq '.object.sha[:7]')
TAG="${BRANCH##*/}-${SHA}"
The CLI wrapper
Nobody wants to type -f update_images=true -f release_branch=releases/1.2 into a gh workflow run command. A shell wrapper makes the intent clear:
tenant update acme # latest release images
tenant update acme --branch releases/1.2 # specific release branch
tenant plan acme # infra diff only, images from state
tenant update dispatches with update_images=true. tenant plan dispatches with update_images=false. Same workflow, different intent, zero ambiguity.
Why nobody’s written about this
I searched. There are hundreds of “push to ECR” tutorials and actions. Dozens of composite action examples. Zero results for state-aware image resolution or separating infra ops from image deployments in the same Terraform workspace.
The gap exists because most tutorials stop at day one. They assume you’re always deploying. In practice, most terraform apply runs are infra changes — IAM policies, env vars, resource quotas, DNS records. Image deployments are a small fraction of applies, and they should be intentional.
The pattern isn’t complicated. It’s a priority chain with a safety gate. But nobody’s named it, and the default behavior of every Terraform EKS tutorial is wrong.