ElastiCache auth-token to RBAC migration has a Terraform provider bug
Needed to migrate a shared ElastiCache Redis cluster from a single auth token to per-user RBAC. Breaking change — every service on the cluster goes dark if you get the sequencing wrong.
The Terraform provider bug
Step one: don’t touch the real cluster. Built a throwaway copy and ran the migration there first.
Good thing — the Terraform AWS provider has a bug in the auth-token removal step. It tells you the auth token was removed. Updates its state file. The plan shows no changes. But the underlying API call silently fails. The token is still active on the cluster.
This means the next step — enabling RBAC user group association — blows up because ElastiCache won’t let you enable RBAC while an auth token is still set. Terraform thinks the token is gone. ElastiCache knows it isn’t.
The fix
You can’t get there through Terraform. You need a single AWS CLI call that does removal and enablement atomically:
aws elasticache modify-replication-group \
--replication-group-id my-cluster \
--auth-token-update-strategy DELETE
This removes the auth token and makes the cluster ready for RBAC in one API call. No intermediate state where the cluster is wide open, no race condition between removal and enablement.
The 4-phase plan
The entire migration exists to make one terrifying moment boring. Here’s the sequencing:
Phase 0 — Build a throwaway test cluster. Run the full migration. Discover the Terraform bug here, not in production. Validate and destroy.
Phase 1 — Infrastructure prep. Add RBAC users and user groups to Terraform. Create per-tenant Redis users with restrictive ACL patterns (~tenant-prefix-*). Apply to prod cluster alongside the existing auth token. Non-breaking — the auth token still works, RBAC users exist but aren’t enforced yet.
Phase 2 — App changes. Each service switches from the shared auth token to its own RBAC username/password (stored in SSM, injected via External Secrets). All services across all tenants need to deploy with the new auth mode before Phase 3.
Phase 3 — The point of no return. One CLI call: auth-token-update-strategy DELETE. Auth token gone, RBAC enforced. Only works if Phase 2 is complete. Everything before this is reversible. Everything after is cleanup.
Phase 4 — Merge stacked PRs back to main. Terraform state reflects reality.
Cross-timezone coordination
My teammate who owns the app layer is 10+ hours ahead. By the time he starts work, I’m asleep. He has three services across multiple repos to update with new auth modes, new images, new credentials through SSM. He’s the conductor of the app side. I’m the conductor of the infra side. We meet in the middle at Phase 3.
Instead of “call me when you get to step 3,” I built:
- A 4-phase JIRA epic with explicit sequencing
- Two PRs on stacked branches (Phase 4 merges into Phase 1 after the switchover)
- CI commands in every ticket — no terraform from laptops, everything traced
- One Slack message with every path, link, and verification command inline
The entire plan exists so that the person executing Phase 3 doesn’t need to understand all four phases. They need to understand one CLI command and two preconditions.
The takeaway
The best infrastructure work looks like nothing happened. Five hours of planning, zero minutes of downtime.