ferkakta.dev

Zero-touch multi-tenant deploys: removing myself from the critical path

I had provisioned two tenants when I realized the deploy process didn’t scale to three. Each tenant on ramparts runs three services – api-server, web-client (the React frontend), tenant-auth – each with its own Docker image in ECR. Deploying a release meant running gh workflow run deploy-tenant.yml -f tenant_name=acme -f action=apply -f update_images=true, then doing it again for the next tenant. With 3 services resolving per run and N tenants, I was the bottleneck. Not Terraform, not GitHub Actions, not ECR. Me, remembering which tenants existed and typing their names correctly.

I wrote about the cross-repo orchestrator pattern – how repository_dispatch ties app repos to a single infra repo workflow. This post is about the other half: what it takes to fan that single trigger out to every tenant without hardcoding who they are.

Tenant discovery is the whole trick

The obvious approach is a list. A YAML file, a GitHub Actions matrix, a Terraform variable – something that enumerates tenants. Every approach means updating that list when you create or destroy a tenant. Every approach means the deploy workflow drifts when someone forgets.

The platform already had a tenant registry in SSM Parameter Store. Every deploy-tenant.yml apply writes a parameter at /ramparts/tenants/{name} with the tenant’s cert ARN and domain. Destroy deletes it. The registry existed for operational queries – “which tenants are active?” – but it turns out it’s also the deployment manifest.

PARAMS=$(aws ssm get-parameters-by-path \
  --path /ramparts/tenants \
  --query 'Parameters[*].Name' \
  --output json)

TENANT_LIST=$(echo "$PARAMS" | jq -r '.[]' | xargs -I{} basename {} | sort)

That’s 4 lines. The orchestrator runs this once at the start of every deploy. No matrix file. No hardcoded tenant list. Add a tenant via the provisioning workflow, and the next deploy includes it. Destroy a tenant, and it drops out. The source of truth is the same SSM path that Terraform already manages.

Sequential applies, not parallel

The deploy loop is a while read over the tenant list:

while IFS= read -r TENANT; do
  [[ -z "$TENANT" ]] && continue

  terraform init -reconfigure \
    -backend-config="key=devops/shared-tenants/${TENANT}/terraform.tfstate" \
    -input=false

  terraform apply \
    -var="tenant_name=${TENANT}" \
    $IMAGE_FLAGS \
    -input=false \
    -auto-approve
done <<< "$TENANT_LIST"

Each tenant has its own Terraform state file at devops/shared-tenants/{tenant}/terraform.tfstate. The -reconfigure flag switches state backends between iterations without prompting. $IMAGE_FLAGS is a set of -var image_digest_*=sha256:... flags, resolved once at the top of the workflow by querying ECR for the latest digest on each service’s release branch. The same flags apply to every tenant – only the state file changes between iterations.

Sequential is deliberate. A matrix strategy would run tenants in parallel, which sounds faster until you consider that each apply touches shared infrastructure – the ALB, the Route53 hosted zone, the EKS cluster’s API server. Parallel Terraform applies against shared state invite lock contention and rate limiting. At current scale, the full loop across all active tenants takes about 45 seconds. Parallelism would save seconds and cost complexity.

No-ops are the common case

Most deploy runs change one service out of three. A merge to releases/0.0.2 in api-server rebuilds the apiserver image. The web-client and tenant-auth images haven’t changed – the orchestrator resolves them to the same digest that’s already in each tenant’s Terraform state.

Terraform compares digests, not tags. If the digest matches, the kubernetes_deployment_v1 resource is unchanged. No pod restart. No rolling update. For a three-service tenant where one service changed, the apply output reads something like:

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

One changed resource is the deployment with the new image digest. Everything else is a no-op. When I verified this end-to-end, three consecutive runs on the same images produced 0 added, 0 changed, 0 destroyed every time. Idempotent deploys are not aspirational here – they’re the mechanism that makes the whole pattern work. Without them, every queued run would churn resources unnecessarily.

Stop on first failure

The loop breaks on the first failed apply:

if terraform apply ... ; then
  SUCCEEDED=$((SUCCEEDED + 1))
else
  FAILED_TENANT="$TENANT"
  echo "::error::Terraform apply failed for tenant: $TENANT"
  break
fi

No partial rollback. No “skip this tenant and continue.” If tenant 3 of 5 fails, tenants 4 and 5 don’t run. The workflow exits with a failure status.

This sounds aggressive, but it works because of the concurrency group. The workflow uses GitHub Actions’ concurrency key to ensure only one deploy runs at a time – queued runs wait until the current one finishes. The next queued run – triggered by another app repo merge, or a manual re-run – starts the loop from scratch. Tenants 1 and 2 are no-ops (already deployed). Tenant 3 gets retried. Tenants 4 and 5 get deployed. The system converges.

The alternative – catch the error, log it, continue – means a single run can leave the fleet in a state where some tenants have the new image and some don’t. I’d rather have a clean failure that the next run fixes than a “partial success” that requires investigation.

State preservation across the loop

Each tenant can have per-tenant settings that the orchestrator shouldn’t overwrite. The vectors_force_destroy variable controls whether a tenant’s S3 bucket can be destroyed with data in it – false for production, true for test tenants. The orchestrator reads it from existing state before each apply:

VFD=$(terraform show -json 2>/dev/null \
  | jq -r '.values.root_module | recurse(.child_modules[]?) | .resources[]?
    | select(.address == "module.tenant.aws_s3_bucket.vectors")
    | .values.force_destroy' 2>/dev/null) || true
VFD="${VFD:-false}"

This is the kind of detail that’s invisible until it bites you. An orchestrator that blindly applies a single default would overwrite per-tenant overrides — resetting a staging tenant’s force_destroy=true back to false, or a production tenant’s carefully-set retention policy to a generic default.

The security detail that almost wasn’t

The orchestrator is triggered by repository_dispatch. The client_payload in a dispatch event is external input – the calling repo controls its contents. If you interpolate ${{ github.event.client_payload.source_repo }} directly in a run: block, you’ve handed the caller shell injection. The orchestrator routes every payload value through env::

- name: Parse trigger context
  env:
    CP_SOURCE_REPO: ${{ github.event.client_payload.source_repo }}
    CP_SHORT_SHA: ${{ github.event.client_payload.short_sha }}
    CP_BRANCH: ${{ github.event.client_payload.branch }}
  run: |
    echo "source_repo=$CP_SOURCE_REPO" >> "$GITHUB_OUTPUT"

The ${{ }} expression still evaluates at YAML parse time, but it writes into an environment variable, not into the shell script source. Bash sees $CP_SOURCE_REPO as a variable reference – data, not code. I wrote about this in detail in the expression injection post.

What the infra repo actually knows

The design principle that makes this work: each layer knows nothing about the layers it doesn’t own. App repos know how to build images and push them to ECR. They don’t know tenants exist. Tenants know their own name and service configuration. They don’t know which repo triggered a deploy. The infra repo’s orchestrator is the single coordination point – it discovers tenants dynamically, resolves images dynamically, and the only static agreement across the system is the releases/* branch naming convention.

A merge to releases/0.0.2 in any of the three app repos triggers a full deployment to every active tenant. Zero human intervention. The first time I watched it run end-to-end – dispatch received, 3 images resolved, tenant discovered from SSM, terraform apply producing a clean 1-changed result – I realized I’d automated myself out of the deployment loop entirely. That was the point.

#github-actions #ci-cd #terraform #aws #multi-tenant #eks