Your onboarding flow is your architecture’s report card

2026-03-03

I ran a colleague’s manual tenant onboarding flow for a multi-tenant SaaS platform. Five steps, two attempts, and a list of errors that mapped precisely to every automation gap in the system. The onboarding flow wasn’t broken. It was a diagnostic.

The five steps

The flow to bring a new tenant from nothing to working:

Run a Python registration script that calls the auth-handler API, creates an org in the identity provider, and sends a confirmation email to the devops team.
Read the devops email. Manually extract two values: a tenant hash and an org code.
Run populate scripts that seed 38 SSM parameters — 24 for apiserver, 14 for tenant-auth-service.
Trigger a GitHub Actions workflow. Terraform creates the namespace, deployments, ExternalSecrets, DNS records, HTTPS.
Manually apply Kubernetes jobs for ETL seed data and first-user creation.

Step 4 is automated. Steps 1, 2, 3, and 5 are manual. The manual steps are where the architecture’s seams show.

What each manual step reveals

Step 2 exists because the registration API doesn’t return the org code. The identity provider sends it via email. The system has a side-channel dependency on email for machine-readable data — a human reading an inbox is load-bearing infrastructure between “tenant registered” and “tenant provisioned.”

Step 3 exists because the Terraform SSM block is commented out. It was written. It works. It has lifecycle { ignore_changes = [value] } so Terraform won’t clobber values after initial seeding. But the developer was still iterating on which parameters to include, so the block was disabled “temporarily.” Temporarily has lasted months. The shell scripts that replaced it accept --tenant, --appname, and --orgcode flags with no validation on any of them.

Step 5 exists because the ETL and first-user jobs depend on the database schema existing (migrations run on pod startup) and on seed data being present. The GitHub Actions workflow doesn’t know when the app is ready, so it can’t sequence the jobs. A human watches for healthy pods and then applies the jobs manually. The gap is a missing readiness contract between infrastructure provisioning and application initialization.

Each manual step is a dependency that the automation doesn’t model. Email as a data bus. Commented-out infrastructure. Missing sequencing between infra and app readiness.

My errors

I made four distinct errors across two runs.

First: I swapped the --tenant and --orgcode flags on the populate script. The script exited 0. It wrote 24 parameters to the wrong SSM path. No validation that the tenant name corresponded to a real tenant. No validation that the org code matched the expected format. Silent success with wrong data.

# What I typed
./populate_apiserver_params.sh --tenant abc123org --orgcode tenant-50

# What I should have typed
./populate_apiserver_params.sh --tenant tenant-50 --orgcode abc123org

Second: I used the wrong --appname value. Same result — exit 0, parameters written to a path that nothing will ever read.

Third: I ran the shell wrappers from the wrong directory. They used relative paths to read config files. From the wrong directory, the relative paths resolved to nothing. The scripts read no config, wrote no parameters, and exited 0.

Fourth: I forgot to run the ETL and first-user Kubernetes jobs after step 4 completed. The app rendered. The login page loaded. It looked working. But there was no data and no users. A hollow deployment that would pass a health check and fail every functional test.

Why the author never hit these bugs

The colleague who wrote these scripts never encountered any of these errors. He wrote them, he knows the invocation by heart, he runs them from the right directory every time. That’s not a defense. That’s the failure mode.

“Works for the person who built it” is the definition of tribal knowledge, not automation. A process that requires the author’s muscle memory to execute correctly has a bus factor of one. The scripts encode implicit contracts — correct flag ordering, correct working directory, correct sequencing of post-deploy jobs — that exist only in the author’s head.

What I built

I wrote a single wrapper script called provision-tenant that chains all five steps. It hardcodes the things that never change: config file paths, app names, AWS region, the SSM path prefix convention. It prompts exactly once for the two values from the email — tenant hash and org code — because that side-channel dependency on email still exists and I can’t fix it from the CLI.

The script validates inputs before writing anything. It waits for pod readiness before applying the ETL and first-user jobs. It cleans up the jobs after completion. One command, one pause for human input, everything else unattended.

provision-tenant

# Interactive:
# Enter tenant name: tenant-51
# Enter org code from devops email: org_abc123
#
# [validates tenant name format]
# [populates 24 apiserver SSM params]
# [populates 14 tenant-auth-service SSM params]
# [triggers GHA workflow, waits for completion]
# [waits for pod readiness]
# [applies ETL job, waits, cleans up]
# [applies first-user job, waits, cleans up]
# Done.

The interesting part is what disappeared. No --appname flag because there are exactly two app names and they never change. No relative paths because everything is resolved from the script’s own location. No chance of forgetting the post-deploy jobs because they’re part of the same execution. The only manual step left is the one that requires reading an email, which is an architecture problem the script can’t solve.

The conversation

The colleague’s first reaction was pushback. “I never had these errors.” True, and irrelevant. He also said, “Even validation is over-engineering for this script.” I didn’t argue. I let him run it and watch it catch a typo in the org code before it wrote 38 parameters to the wrong path.

His conclusion, unprompted: “We should just automate this thing. It is error prone.”

Getting someone to reach the conclusion themselves works better than presenting it as criticism. The wrapper script wasn’t an argument. It was evidence.

The diagnostic

Your onboarding flow is a diagnostic tool. Every manual step maps to a missing contract between systems. Every silent failure reveals absent validation. Every “I just know to do it this way” is a bus factor of one.

Run someone else’s manual flow. Not to prove them wrong — to discover what the automation doesn’t model. The errors you make aren’t your incompetence. They’re your architecture’s report card.

#multi-tenant #automation #platform-engineering #saas