Cross-repo auto-deploy with GitHub Actions: the orchestrator pattern
Two repos merged within seconds of each other. The first orchestrator run failed — web-client’s ECR image didn’t exist yet because the build was still running. The GitHub Actions log showed a red X, an error annotation, and a Slack notification I didn’t need to read.
Four minutes later, the second run deployed both changes. No retry logic. No manual intervention. Nobody touched anything.
I’d spent my day building a cross-repo deploy pipeline for a multi-tenant platform — three app repos pushing Docker images to ECR, one infra repo deploying the new tenant service images to EKS. The race condition was the first real test. It failed exactly the way I wanted it to.
The architecture
App repo (api-server) App repo (web-client) App repo (tenant-auth)
| | |
merge to releases/* merge to releases/* merge to releases/*
| | |
build -> trivy -> ECR push build -> trivy -> ECR push build -> trivy -> ECR push
| | |
+------------- repository_dispatch("release-image-built") ---+
|
Infra repo orchestrator
|
discover tenants from SSM
|
resolve ALL 3 images (latest releases/*)
|
for each tenant: terraform init -> apply
|
stop on first failure, queue next run
Any app repo merge triggers the orchestrator. The orchestrator resolves all three service images, not just the one that triggered it. This is the key design choice — more on why below.
The dispatch
Each app repo’s build workflow gets a trigger-deploy job at the end:
trigger-deploy:
needs: build
if: contains(github.ref, 'releases/')
runs-on: ubuntu-latest
steps:
- name: Dispatch deploy to all tenants
env:
GH_TOKEN: ${{ secrets.INFRA_DISPATCH_TOKEN }}
run: |
gh api repos/ramparts-io/ramparts-infra/dispatches \
-f event_type=release-image-built \
-f "client_payload[source_repo]=${{ github.repository }}" \
-f "client_payload[sha]=${{ github.sha }}" \
-f "client_payload[short_sha]=$(echo ${{ github.sha }} | cut -c1-7)" \
-f "client_payload[branch]=${{ github.ref_name }}"
The GH_TOKEN goes in an env: block because it’s a secret — you don’t want it in the run: string where a crafty PR author could exfiltrate it via a modified workflow. The github.repository and github.sha expressions are safe to inline because they come from GitHub’s own context, not from user input.
The client_payload fields are informational. The orchestrator logs which repo triggered it, but it doesn’t use the payload to decide what to deploy. It always resolves everything from scratch.
The concurrency group
concurrency:
group: deploy-release-images
cancel-in-progress: false
Two words doing heavy lifting: cancel-in-progress: false. If two app repos merge at the same time, the second dispatch queues behind the first. It does not cancel the running deployment.
This matters because the orchestrator resolves all images, not just the triggering one. If api-server and web-client merge within seconds of each other, the first run deploys the new apiserver image with the old client image. The second run — queued, not cancelled — picks up the new client image too. Both runs are correct for the state of the world at the time they ran.
Tenant discovery
SSM Parameter Store doubles as a tenant registry. Every active tenant has a parameter at /ramparts/tenants/{name}, created during initial provisioning and deleted on destroy. The orchestrator discovers the full list at runtime:
PARAMS=$(aws ssm get-parameters-by-path \
--path /ramparts/tenants \
--query 'Parameters[*].Name' \
--output json)
TENANT_LIST=$(echo "$PARAMS" | jq -r '.[]' | xargs -I{} basename {} | sort)
No hardcoded tenant list. No matrix file to update. Add a tenant, and the next deploy includes it automatically.
Image resolution
For each of the three services, the orchestrator calls a composite action* that: finds the latest releases/* branch in the service’s GitHub repo, gets the HEAD SHA, constructs the ECR tag (releases/1.2-abc1234), and retrieves the image digest from ECR.
- name: Resolve apiserver image
uses: ./.github/actions/resolve-ecr-image
with:
ecr_repo: ramparts-apiserver
github_repo: api-server
update_images: 'true'
This runs once, before the tenant loop. The resolved images are formatted into -var flags and reused for every tenant’s terraform apply. If a service hasn’t changed, the digest is the same as what’s in state, and Terraform produces a no-op for that resource. No unnecessary pod restarts.
I wrote about the image resolution priority chain separately — it’s the same composite action, just invoked with update_images: true forced on since the orchestrator’s entire purpose is to deploy new images.
State preservation
Each tenant gets its own Terraform state file. The orchestrator re-initializes Terraform per tenant with a different backend key:
terraform init -reconfigure \
-backend-config="key=devops/shared-tenants/${TENANT}/terraform.tfstate"
Before applying, it reads the existing state for per-tenant settings that shouldn’t be overwritten:
VFD=$(terraform show -json 2>/dev/null \
| jq -r '.. | select(.address == "module.bedrock_tenant.aws_s3_bucket.vectors") |
.values.force_destroy' 2>/dev/null) || true
vectors_force_destroy is a tenant-level setting that controls whether S3 buckets can be destroyed with data in them. Production tenants have it set to false. Test tenants have it true. The orchestrator preserves whatever’s in state rather than imposing a default.
What goes wrong (and why it’s fine)
The obvious race condition: api-server merges to releases/1.2, web-client merges to releases/1.0 simultaneously. The first orchestrator run tries to resolve web-client from releases/1.0, but the ECR image for the new commit hasn’t been pushed yet because the client’s build is still running. Resolution fails, the run fails.
This is expected. The second orchestrator run is already queued. By the time it starts, the client build has finished. It resolves all three images successfully and deploys them.
The self-healing depends on three properties:
- Concurrency group with
cancel-in-progress: false— the second run queues, it doesn’t disappear - The orchestrator resolves all images — the second run doesn’t just deploy its triggering repo’s image, it picks up everything
- Idempotent downstream operations — Terraform with digest-pinned images produces no-ops for unchanged services; the Kubernetes operator handles re-applies without side effects
You could add retry logic or a delay. I didn’t. The concurrency group already provides the retry for free, and adding complexity to handle a race condition that self-resolves in minutes is not worth the maintenance cost.
The pattern
Strip away the AWS and Terraform specifics and the pattern is:
- Event-driven triggers, not polling. App repos push events when they have something to deploy. The infra repo doesn’t poll ECR for new tags.
- Single orchestrator that resolves everything. Not per-service deploys. One workflow that understands the full set of services and deploys them as a unit. This eliminates partial-state problems where tenant A has the new API but the old client.
- Concurrency groups for serialization. Simultaneous triggers queue, not race. Each run is complete and correct for the state of the world when it executes.
- Idempotent downstream operations. If nothing changed, nothing happens. If a run fails, the next run fixes it. No manual intervention required.
This works for any microservices architecture with separate app and infra repos. The specific tools — repository_dispatch, Terraform, ECR — are interchangeable. The design constraints are not.
* A composite action is a reusable GitHub Actions building block — a directory with an action.yml that defines inputs, outputs, and a sequence of steps. Unlike a reusable workflow, it runs inline in the calling job’s runner, sharing the same filesystem and environment.