ferkakta.dev

From eight manual steps to one command

I provisioned two tenants by hand before I decided that nobody should ever provision a tenant by hand.

The provisioning flow for our multi-tenant SaaS platform was 8 steps across 4 tools — a Python CLI, a shell script with 5 flags per invocation, a GitHub Actions workflow, and two Kubernetes job manifests requiring injected DB connection strings. Each step had different inputs, different env files, and subtly different flag names for the same concept. The two populate runs used --appname apiserver and --appname tenant_auth_service — note the underscore in one and not the other. That naming inconsistency is a guaranteed typo on a Friday afternoon. Each flag is a chance to silently write 24 SSM parameters to the wrong path.

I ran the full flow twice. Both times I fat-fingered something. Both times the failure was silent — the wrong SSM path got populated, the right one stayed empty, and the pod crash 10 minutes later gave no indication which of the 38 parameters was missing.

So I wrote a wrapper script. Not Terraform, not a CRD, not a proper automation pipeline. A shell script.

The flow before the wrapper

Here is what provisioning a single tenant looked like before:

# 1. Register in auth system + identity provider
python tenant_ops_client.py --command add_new_tenant --config ridgeback.yml

# 2. Check email, copy tenant hash and org code (human step)

# 3. Populate apiserver SSM params (24 parameters)
./populate_ssm.sh --envfile env_api_server --tenant momcorp \
  --orgcode org_abc123 --region us-east-1 --appname apiserver

# 4. Populate tenant-auth-service SSM params (14 parameters)
./populate_ssm.sh --envfile env_tenant_auth --tenant momcorp \
  --orgcode org_abc123 --region us-east-1 --appname tenant_auth_service

# 5. Trigger Terraform workflow
gh workflow run terraform-shared-tenant.yml \
  -f tenant_name=momcorp -f action=apply

# 6. Wait for workflow, force ESO sync, wait for pods
gh run watch --exit-status
kubectl rollout restart -n tenant-momcorp deployment/...

# 7. Apply ETL job, wait, delete
kubectl apply -f etl-job.yaml
kubectl wait --for=condition=complete job/etl -n tenant-momcorp
kubectl delete job etl -n tenant-momcorp

# 8. Apply first-user job, wait, delete
kubectl apply -f first-user-job.yaml
kubectl wait --for=condition=complete job/first-user -n tenant-momcorp
kubectl delete job first-user -n tenant-momcorp

That is 8 commands with 5 flags each, 2 different env files, 2 different appnames, a YAML config, a GHA workflow dispatch, and 2 K8s job manifests that need DB connection strings and Docker image refs injected at apply time. The tenant name appears in every single command. The org code appears in two of them. The region appears in two of them and never changes. The env file paths are static but different per step. Every one of these is a variable that should be a constant, and every constant that’s typed instead of derived is an error waiting to happen.

The wrapper

provision-tenant --config command_ridgeback_create.yml

Or fully interactive:

provision-tenant
# Prompts: tenant name, website, admin email, admin name, address...
# Runs step 1, parses response for tenant hash + UUID
# Pauses once: "Org code (from devops email):"
# Then runs steps 3-8 unattended

One command. One human pause. The rest is automated sequencing with hardcoded constants.

What the wrapper actually does

Step 1 is included — it calls the Python client directly and parses the JSON response for tenant_hash and uuid. No need to run it separately, no need to copy values from terminal output.

Step 2 is a read -rp prompt. The org code comes via email from our IdP provider, not from the API response. I cannot eliminate this human step until we wire the IdP management API to return it programmatically. So the wrapper pauses once, collects the org code, and moves on. One prompt is better than eight commands.

Steps 3 and 4 hardcode --envfile paths, --region, and --appname. These never change between tenants. The two values that do change — tenant_hash and org_code — come from steps 1 and 2. The wrapper eliminates the class of error where you type apiserver when you mean tenant_auth_service, or point at the wrong env file, or forget --region and get a default you didn’t expect.

Steps 5 through 8 use gh run watch --exit-status to block until the Terraform workflow completes, then kubectl wait --for=condition=ready for pod readiness, then render and apply the K8s job manifests with the correct DB URL and image tag extracted from the running deployment. Cleanup is automatic — kubectl delete job and kubectl delete secret after each job completes.

What the wrapper does not do

No error recovery. If step 4 fails, you get an error message and a half-provisioned tenant. That is fine. The real automation — Terraform state plus operator CRDs — will handle rollback and partial state. This wrapper is a bridge, not a destination. Adding retry logic or rollback to a shell script is how shell scripts become unmaintainable systems.

No parallel execution. Steps run sequentially because they have real dependencies — step 3 writes SSM parameters that step 5’s Terraform reads, step 5 creates the namespace that steps 7 and 8 deploy jobs into. Running them sequentially is honest about the dependency chain.

No configuration file format. Just CLI flags and interactive prompts. If the provisioning flow needs a config file with schema validation and default merging, that is a sign you should be building the real automation, not extending the wrapper.

The automation spectrum

Most teams I have worked with treat automation as binary — either you run it by hand with a wiki page, or you build the full pipeline. The wiki page lasts longer than anyone intends because the full pipeline is always a week away and there are always higher priorities. Meanwhile the wiki page accumulates errata in italics, and the person who wrote it leaves, and the new person follows the steps but misses the italicized warning about the underscore in tenant_auth_service.

The wrapper script is the middle ground that nobody talks about because it is not architecturally interesting. It does not solve the automation problem. It makes the automation problem survivable. It captures the operational knowledge that currently lives in one person’s head — which flags never change, which values come from which prior step, which commands block and which fire-and-forget — and encodes it in something executable.

The three tiers this buys time for

The wrapper is tier 0. It does not change the architecture. It chains the existing manual steps and eliminates the most common errors. Here is what comes next:

Tier 1 (hours of work): uncomment the Terraform SSM parameter block that already exists in the tenant module. This eliminates steps 3 and 4 entirely — Terraform writes the SSM parameters directly, no populate script needed.

Tier 2 (days): add ETL and first-user as steps in the GitHub Actions workflow, triggered after terraform apply succeeds. This eliminates the manual K8s job application in steps 7 and 8.

Tier 3 (weeks): a TenantRootUser CRD in the Kopf operator. The operator watches for the CRD, calls the auth API, retrieves the org code programmatically, and provisions the tenant end-to-end. This eliminates steps 1 and 2 — the entire flow becomes kubectl apply -f tenant.yaml.

Each tier makes the previous one obsolete. Tier 1 kills the wrapper’s SSM steps. Tier 2 kills its K8s job steps. Tier 3 kills the wrapper entirely. That is the plan. The wrapper is what makes the plan survivable while you execute it, because tenants need provisioning now, not after tier 3 ships.

The bridge pattern

The best automation you can ship today is a shell script that chains the manual steps, hardcodes the constants, and prompts for the variables. It is not elegant. It is not the final answer. It will never appear in an architecture diagram or a conference talk. But it captures what you know, prevents the errors you have already hit, and gives you a running start on the real thing. The wrapper is not technical debt — it is operational knowledge with an expiration date. Write it, use it, and replace it when the real automation arrives.

#automation #shell #multi-tenant #devops