Self-healing race conditions: when your CI/CD fails on purpose

2026-02-20

Three app repos build Docker images and push them to ECR. On merge, each fires a repository_dispatch to an infra repo’s orchestrator workflow. The orchestrator resolves ALL service images — not just the one that triggered it — and deploys every tenant via Terraform.

What happens when two repos merge at the same time?

The sequence

T=0: web-client and tenant-auth both merge to releases/0.0.2.
T=2m: tenant-auth build finishes first, fires dispatch.
T=2.5m: Orchestrator Run A starts. Tries to resolve all 3 service images.
T=2.5m: web-client image doesn’t exist yet — still building. Run A fails at image resolution.
T=4m: web-client build finishes, fires its own dispatch.
T=4m: Orchestrator Run B starts. Run A already finished (failed), so the concurrency group is free.
T=4m: Run B resolves all 3 images. Both new ones exist now. Deploy succeeds.

The end state is correct. Both changes deployed. One workflow run failed. Nobody had to do anything.

Why the failure is harmless

Three properties make this work.

Concurrency groups with queuing

concurrency:
  group: deploy-tenants
  cancel-in-progress: false

cancel-in-progress: false means Run B doesn’t cancel Run A — it queues. If Run A were still running when Run B arrived, Run B would wait. The concurrency group serializes all deploys. No two orchestrator runs execute simultaneously.

Resolve all, not just the trigger

The orchestrator resolves every service image on every run. Run B doesn’t just deploy the web-client image that triggered it — it re-resolves tenant-auth and apiserver too. This means Run B picks up both the tenant-auth change (from T=0) and the web-client change (from T=0), even though only web-client fired the dispatch.

This is the key design decision. If the orchestrator only deployed the triggering service, you’d need coordination between repos. By resolving everything every time, each run is a full reconciliation.

Idempotent downstream operations

Terraform is idempotent. Deploying the same image digest twice is a no-op. The Kopf operator is idempotent — re-applying a Tenant CRD doesn’t rotate credentials or recreate databases.

So even if Run A partially succeeded — say it deployed tenant 1 of 3 before failing on image resolution for tenant 2 — Run B re-applies tenant 1 (no-op) and continues through tenants 2 and 3. No rollback logic. No “pick up where we left off” bookkeeping. Just run the whole thing again.

What you don’t need

You could add complexity to prevent the failure entirely:

Poll for all images before starting. Now you need a timeout, a retry loop, and a decision about how long to wait before giving up.
Coordinate between app repos. Now you need a shared queue, deduplication, and a way to know when all pending builds have finished.
Add a dead letter queue for failed runs. Now you need monitoring for the monitor.

Or you could accept that the first run fails and design the system so the next run fixes it automatically.

“Stop on first failure; next queued run retries from scratch” is the entire retry strategy. No exponential backoff. No distributed locking. The concurrency group serializes. Idempotency makes re-runs safe. Eventual consistency emerges from these two properties alone.

The only real failure mode

The self-healing breaks if ALL dispatches fail — every app repo’s build fails, or the dispatch mechanism itself is broken. But that indicates a real infrastructure problem, not a race condition. A single successful dispatch is enough to converge the entire system to the correct state, because every run resolves every image.

The failed run in your GitHub Actions log looks alarming. Red X, error annotations, a Slack notification if you’ve wired one up. But it’s a feature. The system converges to the correct state within one additional dispatch cycle, and it does so without any component being aware that a race condition occurred.

#github-actions #ci-cd #terraform