Making a Kopf operator idempotent: three-layer existence checks and the redisReady race
Our tenant operator provisions databases, cache users, and credentials for each tenant in a multi-tenant SaaS platform. PostgreSQL roles on shared RDS, ElastiCache RBAC users, SSM parameters with generated passwords. It worked exactly once per tenant. The second time it ran, it regenerated every password and overwrote every SSM parameter. Running services holding the old credentials immediately lost their database and cache connections.
This was the blocker for auto-deploy.
Every deploy was a coordinated outage
The orchestrator runs terraform apply for each tenant on every deploy. Terraform reconciles the Tenant CRD, which fires Kopf’s on_tenant_create handler. The handler doesn’t distinguish between “new tenant” and “existing tenant whose CRD was re-applied.” It generates fresh passwords, creates new PostgreSQL roles (which fail because the role exists, or worse, succeed and orphan the old one), and overwrites SSM parameters with credentials that no running pod knows about.
One deploy, every tenant’s services go dark. That is not a deploy pipeline — it is a coordinated outage generator.
Three layers of state to check
The obvious answer is to check whether resources exist before creating them. The non-obvious part is that “exists” spans three independent systems — PostgreSQL, ElastiCache, and SSM — and all three have to agree before the handler can safely skip provisioning.
async def tenant_user_exists(tenant_name: str, app_name: str) -> bool:
username = _make_username(tenant_name=tenant_name, app_name=app_name)
conn = get_master_connection()
try:
with conn.cursor() as cur:
cur.execute("SELECT 1 FROM pg_roles WHERE rolname = %s", [username])
return cur.fetchone() is not None
finally:
conn.close()
Same pattern for ElastiCache (describe_users, catch UserNotFoundFault) and SSM (get_parameter, catch ParameterNotFound). If all three layers exist for all apps, the handler sets status and returns without touching anything:
if await _tenant_fully_provisioned(name, TENANT_APPS):
log.info("tenant_already_provisioned", tenant=name)
patch.status["phase"] = "Ready"
patch.status["dbReady"] = True
patch.status["redisReady"] = True
return
The cost is 5 API calls per re-apply. The cost of not doing it is credential rotation on every deploy. I shipped this, tested it, confirmed that re-applies were no-ops. Done.
Then redisReady disappeared.
The race I didn’t see
kubectl get tn acme -o jsonpath='{.status.redisReady}' returned empty. The operator logs showed Redis provisioning succeeded. The status field was missing from the CRD.
Kopf’s @kopf.on.field handler for spec.plan fires during initial creation — not because anyone changed the plan, but because spec.plan transitions from None to "standard" when the CRD is first applied. This plan-change handler runs concurrently with on_tenant_create. Both handlers produce status patches. Kopf reconciles them, and the plan-change handler’s patch — which does not set redisReady — overwrites the create handler’s patch that does.
Two handlers, two patches, one winner. The wrong one.
I’d been thinking about idempotency as a data problem — do the external resources exist? But the status race was a concurrency problem inside the operator itself. The existence checks were correct. The handler was correct. The patch that recorded the result got silently overwritten by a handler that shouldn’t have fired at all.
The fix is a guard clause:
@kopf.on.update("tenants.ramparts.com", field="spec.plan")
async def on_plan_change(old, new, **kwargs):
if old is None:
return # initial creation, not a real plan change
# ... handle actual plan changes
On initial creation, old is None because spec.plan didn’t previously exist. The handler bails out, produces no patch, and the create handler’s redisReady: true survives.
What I assumed twice
The first assumption was that on_tenant_create only fires on creation. It doesn’t — Kopf fires it on every re-apply. CRD re-applies, spec updates, operator restarts, informer resyncs. If your handler creates external resources, every creation path needs an existence check.
The second assumption was that fixing idempotency meant fixing the data layer. It didn’t — the data was fine. The operator’s own internal state management was racing against itself, and the symptom looked identical to the problem I’d just solved. I checked the databases. I checked SSM. I checked ElastiCache. Everything existed. The bug was in the patch, not the provisioning.
Two layers of “I thought I understood this.” Five API calls and one guard clause.