ferkakta.dev

ElastiCache auth-token to RBAC migration has a Terraform provider bug

Needed to migrate a shared ElastiCache Redis cluster from a single auth token to per-user RBAC. Breaking change — every service on the cluster goes dark if you get the sequencing wrong.

The Terraform provider bug

Step one: don’t touch the real cluster. Built a throwaway copy and ran the migration there first.

Good thing — the Terraform AWS provider has a bug in the auth-token removal step. It tells you the auth token was removed. Updates its state file. The plan shows no changes. But the underlying API call silently fails. The token is still active on the cluster.

This means the next step — enabling RBAC user group association — blows up because ElastiCache won’t let you enable RBAC while an auth token is still set. Terraform thinks the token is gone. ElastiCache knows it isn’t.

The fix

You can’t get there through Terraform. You need a single AWS CLI call that does removal and enablement atomically:

aws elasticache modify-replication-group \
  --replication-group-id my-cluster \
  --auth-token-update-strategy DELETE

This removes the auth token and makes the cluster ready for RBAC in one API call. No intermediate state where the cluster is wide open, no race condition between removal and enablement.

The 4-phase plan

The entire migration exists to make one terrifying moment boring. Here’s the sequencing:

Phase 0 — Build a throwaway test cluster. Run the full migration. Discover the Terraform bug here, not in production. Validate and destroy.

Phase 1 — Infrastructure prep. Add RBAC users and user groups to Terraform. Create per-tenant Redis users with restrictive ACL patterns (~tenant-prefix-*). Apply to prod cluster alongside the existing auth token. Non-breaking — the auth token still works, RBAC users exist but aren’t enforced yet.

Phase 2 — App changes. Each service switches from the shared auth token to its own RBAC username/password (stored in SSM, injected via External Secrets). All services across all tenants need to deploy with the new auth mode before Phase 3.

Phase 3 — The point of no return. One CLI call: auth-token-update-strategy DELETE. Auth token gone, RBAC enforced. Only works if Phase 2 is complete. Everything before this is reversible. Everything after is cleanup.

Phase 4 — Merge stacked PRs back to main. Terraform state reflects reality.

Cross-timezone coordination

My teammate who owns the app layer is 10+ hours ahead. By the time he starts work, I’m asleep. He has three services across multiple repos to update with new auth modes, new images, new credentials through SSM. He’s the conductor of the app side. I’m the conductor of the infra side. We meet in the middle at Phase 3.

Instead of “call me when you get to step 3,” I built:

The entire plan exists so that the person executing Phase 3 doesn’t need to understand all four phases. They need to understand one CLI command and two preconditions.

The takeaway

The best infrastructure work looks like nothing happened. Five hours of planning, zero minutes of downtime.

#aws #elasticache #terraform #redis