ferkakta.dev

Cross-repo auto-deploy with GitHub Actions: the orchestrator pattern

Two repos merged within seconds of each other. The first orchestrator run failed — web-client’s ECR image didn’t exist yet because the build was still running. The GitHub Actions log showed a red X, an error annotation, and a Slack notification I didn’t need to read.

Four minutes later, the second run deployed both changes. No retry logic. No manual intervention. Nobody touched anything.

I’d spent my day building a cross-repo deploy pipeline for a multi-tenant platform — three app repos pushing Docker images to ECR, one infra repo deploying the new tenant service images to EKS. The race condition was the first real test. It failed exactly the way I wanted it to.

The architecture

App repo (api-server)         App repo (web-client)         App repo (tenant-auth)
       |                              |                              |
  merge to releases/*            merge to releases/*            merge to releases/*
       |                              |                              |
  build -> trivy -> ECR push     build -> trivy -> ECR push     build -> trivy -> ECR push
       |                              |                              |
       +------------- repository_dispatch("release-image-built") ---+
                                      |
                              Infra repo orchestrator
                                      |
                           discover tenants from SSM
                                      |
                        resolve ALL 3 images (latest releases/*)
                                      |
                    for each tenant: terraform init -> apply
                                      |
                         stop on first failure, queue next run

Any app repo merge triggers the orchestrator. The orchestrator resolves all three service images, not just the one that triggered it. This is the key design choice — more on why below.

The dispatch

Each app repo’s build workflow gets a trigger-deploy job at the end:

trigger-deploy:
  needs: build
  if: contains(github.ref, 'releases/')
  runs-on: ubuntu-latest
  steps:
    - name: Dispatch deploy to all tenants
      env:
        GH_TOKEN: ${{ secrets.INFRA_DISPATCH_TOKEN }}
      run: |
        gh api repos/ramparts-io/ramparts-infra/dispatches \
          -f event_type=release-image-built \
          -f "client_payload[source_repo]=${{ github.repository }}" \
          -f "client_payload[sha]=${{ github.sha }}" \
          -f "client_payload[short_sha]=$(echo ${{ github.sha }} | cut -c1-7)" \
          -f "client_payload[branch]=${{ github.ref_name }}"

The GH_TOKEN goes in an env: block because it’s a secret — you don’t want it in the run: string where a crafty PR author could exfiltrate it via a modified workflow. The github.repository and github.sha expressions are safe to inline because they come from GitHub’s own context, not from user input.

The client_payload fields are informational. The orchestrator logs which repo triggered it, but it doesn’t use the payload to decide what to deploy. It always resolves everything from scratch.

The concurrency group

concurrency:
  group: deploy-release-images
  cancel-in-progress: false

Two words doing heavy lifting: cancel-in-progress: false. If two app repos merge at the same time, the second dispatch queues behind the first. It does not cancel the running deployment.

This matters because the orchestrator resolves all images, not just the triggering one. If api-server and web-client merge within seconds of each other, the first run deploys the new apiserver image with the old client image. The second run — queued, not cancelled — picks up the new client image too. Both runs are correct for the state of the world at the time they ran.

Tenant discovery

SSM Parameter Store doubles as a tenant registry. Every active tenant has a parameter at /ramparts/tenants/{name}, created during initial provisioning and deleted on destroy. The orchestrator discovers the full list at runtime:

PARAMS=$(aws ssm get-parameters-by-path \
  --path /ramparts/tenants \
  --query 'Parameters[*].Name' \
  --output json)
TENANT_LIST=$(echo "$PARAMS" | jq -r '.[]' | xargs -I{} basename {} | sort)

No hardcoded tenant list. No matrix file to update. Add a tenant, and the next deploy includes it automatically.

Image resolution

For each of the three services, the orchestrator calls a composite action* that: finds the latest releases/* branch in the service’s GitHub repo, gets the HEAD SHA, constructs the ECR tag (releases/1.2-abc1234), and retrieves the image digest from ECR.

- name: Resolve apiserver image
  uses: ./.github/actions/resolve-ecr-image
  with:
    ecr_repo: ramparts-apiserver
    github_repo: api-server
    update_images: 'true'

This runs once, before the tenant loop. The resolved images are formatted into -var flags and reused for every tenant’s terraform apply. If a service hasn’t changed, the digest is the same as what’s in state, and Terraform produces a no-op for that resource. No unnecessary pod restarts.

I wrote about the image resolution priority chain separately — it’s the same composite action, just invoked with update_images: true forced on since the orchestrator’s entire purpose is to deploy new images.

State preservation

Each tenant gets its own Terraform state file. The orchestrator re-initializes Terraform per tenant with a different backend key:

terraform init -reconfigure \
  -backend-config="key=devops/shared-tenants/${TENANT}/terraform.tfstate"

Before applying, it reads the existing state for per-tenant settings that shouldn’t be overwritten:

VFD=$(terraform show -json 2>/dev/null \
  | jq -r '.. | select(.address == "module.bedrock_tenant.aws_s3_bucket.vectors") |
  .values.force_destroy' 2>/dev/null) || true

vectors_force_destroy is a tenant-level setting that controls whether S3 buckets can be destroyed with data in them. Production tenants have it set to false. Test tenants have it true. The orchestrator preserves whatever’s in state rather than imposing a default.

What goes wrong (and why it’s fine)

The obvious race condition: api-server merges to releases/1.2, web-client merges to releases/1.0 simultaneously. The first orchestrator run tries to resolve web-client from releases/1.0, but the ECR image for the new commit hasn’t been pushed yet because the client’s build is still running. Resolution fails, the run fails.

This is expected. The second orchestrator run is already queued. By the time it starts, the client build has finished. It resolves all three images successfully and deploys them.

The self-healing depends on three properties:

  1. Concurrency group with cancel-in-progress: false — the second run queues, it doesn’t disappear
  2. The orchestrator resolves all images — the second run doesn’t just deploy its triggering repo’s image, it picks up everything
  3. Idempotent downstream operations — Terraform with digest-pinned images produces no-ops for unchanged services; the Kubernetes operator handles re-applies without side effects

You could add retry logic or a delay. I didn’t. The concurrency group already provides the retry for free, and adding complexity to handle a race condition that self-resolves in minutes is not worth the maintenance cost.

The pattern

Strip away the AWS and Terraform specifics and the pattern is:

This works for any microservices architecture with separate app and infra repos. The specific tools — repository_dispatch, Terraform, ECR — are interchangeable. The design constraints are not.


* A composite action is a reusable GitHub Actions building block — a directory with an action.yml that defines inputs, outputs, and a sequence of steps. Unlike a reusable workflow, it runs inline in the calling job’s runner, sharing the same filesystem and environment.

#github-actions #ci-cd #terraform #aws #multi-tenant