ferkakta.dev

The TRUNCATE pattern we trusted in Redshift was broken

I was working in the data acquisition nerve center of a multistrategy hedge fund — a large Airflow estate running against a Redshift cluster. The architecture had a materialization layer — Spectrum tables backed by S3 were cheap to store but expensive to query, so we maintained _latest snapshot tables in Redshift — the hot layer that portfolio managers, quants, and sector analysts actually queried before market open. Every time a pipeline ingested new data, it refreshed its snapshot with the same pattern:

BEGIN TRANSACTION;
TRUNCATE TABLE schema.prices_latest;
INSERT INTO schema.prices_latest SELECT * FROM spectrum.prices_raw;
COMMIT TRANSACTION;

Wipe the snapshot, repopulate it from Spectrum, and do it atomically so nobody sees an empty table. Every data engineer I worked with wrote it this way. It is the obvious pattern. It looks correct.

It was not correct. One pipeline failure left a snapshot table empty with no way to roll back, and that sent me back to the documentation.

The transaction was not doing what we thought

TRUNCATE in Redshift is DDL, not DML. According to the Redshift documentation, DDL implicitly commits the current transaction. That means when our pattern ran:

  1. BEGIN — opens a transaction
  2. TRUNCATE — implicitly commits the open transaction, then executes as its own auto-committed operation
  3. INSERT — runs outside any transaction
  4. COMMIT — no-op

The snapshot table is permanently empty after step 2. If the INSERT fails — bad source data, a Spectrum timeout, a cluster restart — the table stays empty. There is no rollback. The BEGIN/COMMIT wrapper is decoration. The atomicity we were counting on does not exist.

The deadlock monitor was telling us the whole time

The operational symptom was worse than the occasional empty table. We had a deadlock monitor DAG that queried admin.v_get_blocking_locks every fifteen minutes and paged oncall via PagerDuty if any session was blocked for more than thirty minutes. It was firing regularly. Analysts complained about slow queries on the snapshot tables — the same tables that existed specifically to be fast. Someone would check the Redshift console, see lock waits, and blame cluster load.

Oncall is not just firefighting. It’s a systems architecture dojo for people who can extrapolate patterns and harvest them. I traced the repeated blocking episodes back to these refreshes. TRUNCATE acquires an ACCESS EXCLUSIVE lock, and in our environment the lock contention repeatedly lined up with the duration of the INSERT INTO ... SELECT from Spectrum — which meant reading from S3-backed storage instead of Redshift’s own optimized columnar engine. These were the slowest queries in the pipeline, and while they ran, the most popular tables in the warehouse were locked for reads.

I identified the bug class, explained the failure mode and the blocking behavior, and walked my boss through it. He saw through to the operational impact — lock contention hammering the trading floor — and assigned a contractor to replace the pattern across a batch of affected DAGs.

Moving the slow work off the live table

The replacement pattern keeps the expensive Spectrum query away from the live table entirely:

DROP TABLE IF EXISTS schema.prices_latest_new;
CREATE TABLE schema.prices_latest_new (LIKE schema.prices_latest);

INSERT INTO schema.prices_latest_new
SELECT * FROM spectrum.prices_raw;

BEGIN TRANSACTION;
ALTER TABLE schema.prices_latest RENAME TO prices_latest_old;
ALTER TABLE schema.prices_latest_new RENAME TO prices_latest;
COMMIT TRANSACTION;

DROP TABLE IF EXISTS schema.prices_latest_old;

The INSERT INTO ... SELECT from Spectrum — the slow part, reading from S3 instead of Redshift — writes into prices_latest_new, which nobody is reading. No locks on the live table. The transaction only wraps two ALTER TABLE RENAME operations. The live snapshot table is unavailable for the duration of two metadata pointer flips, not the duration of a cross-service scan from S3.

This time the atomicity is real — if the insert fails, the live table is untouched. If a rename fails, the transaction rolls back and the live table keeps its original name. This pattern was right for our snapshot tables. Rename-swap has caveats in other contexts, especially where downstream behavior depends on object identity rather than just the table name. For our _latest materialization layer, those tradeoffs were acceptable.

Five months of the same refactor

We baked the rename-swap pattern into our Airflow plugins so new DAGs got it by default. The old ones were the problem — retrofitting the pattern across a large pipeline estate proceeded fitfully, with the team occasionally getting the transaction boundary subtly wrong in ways that reintroduced the lock. It took months.

TRUNCATE is not DELETE

TRUNCATE feels like DELETE FROM with no WHERE clause, and that mental model is why the pattern persists. In Redshift, DELETE is DML — it permits concurrent reads. TRUNCATE is DDL — it implicitly commits the current transaction and acquires a lock that blocks everything. The resulting BEGIN; TRUNCATE; INSERT; COMMIT pattern looks natural enough that it shows up in documentation and Stack Overflow answers without much scrutiny. And the materialization layer made the consequences worse — the _latest tables were the most queried tables in the warehouse, and every refresh blocked readers for the duration of a Spectrum scan. The architecture that made queries fast was also the one that made refresh mistakes expensive.

The diagnosis was mine

I found a bug class, named it clearly enough that the organization could act on it, and helped turn it into a broader remediation effort. Then I moved on while other engineers implemented. The commits are theirs. The diagnosis was mine. That’s the part I kept.

#redshift #data-engineering #platformengineering