The Redshift Bug I Mistook for Postgres

2026-03-29

The page looks simple enough that I think I understand it immediately. A snapshot table is empty after a failed refresh.

I am working in the data acquisition side of a place with an observatory culture I still miss. Data management, data acquisition, quant researchers, and portfolio managers are in real partnership there. The snapshot tables matter because they sit between raw ingestion and people trying to understand the world a little sooner than everybody else.

I am on call, staring at SQL that is dressed like it should be safe. There is a BEGIN. There is a COMMIT. There is a failure in the middle. If those words mean what I think they mean, the table should still be full. It is not.

The shape that keeps pulling my eyes back is the same four-statement ritual:

BEGIN;
TRUNCATE TABLE some_schema.some_table;
INSERT INTO some_schema.some_table
SELECT ...
COMMIT;

If you learn this pattern in Postgres-adjacent systems, it reads as careful. It says somebody thought about atomicity. It tells me I am looking at an honest attempt at safety.

That is why it takes me longer than it should.

In Redshift, TRUNCATE is not participating in the transaction the way my brain expects. It commits. The BEGIN around it is not protecting the statement I care about.

Once I see that boundary correctly, the failure stops being mysterious. The table is not empty because rollback is broken. The table is empty because rollback never owned the TRUNCATE in the first place.

The boundary that matters

I remember this one so clearly because the system stops feeling weird the second the mental model snaps into place. Redshift is behaving exactly as designed. I am the one importing the wrong semantics.

That is the part I still like about the story. I do not outsmart a broken database. I realize I have been trusting the wrong promise.

The safer shape is to separate the destructive load-table reset from the live-table transactional work. In practice it looks more like this:

TRUNCATE TABLE some_schema.load_table;

COPY some_schema.load_table
FROM 's3://bucket/path'
IAM_ROLE 'arn:aws:iam::123456789012:role/example'
FORMAT CSV;

BEGIN;
DELETE FROM some_schema.live_table
USING (
  SELECT DISTINCT business_key
  FROM some_schema.load_table
) AS load
WHERE some_schema.live_table.business_key = load.business_key;

INSERT INTO some_schema.live_table
SELECT *
FROM some_schema.load_table;
COMMIT;

That still leaves a caveat around the load table itself. If the COPY fails after the TRUNCATE, the load table is empty. What it does not do is pretend the live-table refresh is atomic when it is not.

That distinction matters a lot when the table is part of a hot snapshot layer and you are the person getting paged when it disappears out from under downstream jobs.

Once I can explain the mechanics cleanly, the whole conversation changes shape. I am not waving at one ugly page anymore. I am naming a bug class. I can say where the transaction boundary actually is, why the familiar pattern is false comfort in Redshift, and what kind of rewrite avoids the lie.

My manager sees it quickly. That is the part I remember more than any rollout detail. The issue goes from on-call annoyance to accepted engineering problem. After that, remediation gets assigned across a batch of DAGs, and the fixes spread the way they usually do in a big Airflow estate: unevenly, with a few earlier narrow fixes, then a broader sweep, then a long tail of follow-up work.

Part of why this stayed with me is that on-call never felt like a tax to me. A lot of people around me, many of them engineers I admired, experienced it as cleanup duty, fixing somebody else’s DAG while trying to protect the work they were already carrying. I never quite had that reaction. I came into this kind of engineering by the long way around, through Linux administration, shared hosting resentment, and the steady realization that if I cared about a system, I wanted root on it. I wanted to know where the boundary actually was. I wanted the real shared vocabulary of Unix, not the managed, abstracted version of it. For me, on-call was a systems architecture dojo. When a page forced me down into the machinery, it energized me. It was not that they were wrong and I was right. It was that the same incident offered me something different. I saw oversized gains hiding inside an ugly incident.

What I am actually keeping

Years later, the interesting thing to me is not the employer or the repo. It is the way one word brings the whole memory graph back. For weeks I know I am trying to remember a Redshift thing I figured out while on call, something real enough that I had convinced my manager and set cleanup work in motion, but I cannot get the shape of it back into focus. Then I see another data engineer mention TRUNCATE in a LinkedIn post, and the whole story comes back at once.

TRUNCATE is the retrieval key.

I see that word, and I remember the page, the shape of the SQL, the exact feeling of realizing that BEGIN does not mean what I think it means in that engine, and the relief of being able to explain the problem in one sentence to somebody who can do something about it. Years later I rerun the exact four-statement ritual I remembered, this time on Redshift Serverless. Same outcome. The INSERT fails, the table stays empty, and the wrapper around it is decorative.

I also remember what this kind of work used to do to my memory. Before I started writing in public, I had almost no durable record of my own technical life. My understanding lived inside systems I did not own, tickets I would lose access to, repos that would dissolve into hearsay the second I left. If a keyword did not bring the story back, it was gone.

That is a bad bargain for an engineer. The most important artifact is not the proprietary code. It is the corrected mental model.

This bug gives me a clean example. I do not need a forensic reconstruction of every later change to know what mattered. I find that BEGIN / TRUNCATE / INSERT / COMMIT is not atomic in Redshift. I connect it to real operational pain. I explain it well enough that management acts.

What I keep from it is small, and durable. Systems behave correctly more often than I do. When a database surprises me now, I ask earlier whether I am looking at engine misbehavior, or at a mental model I imported from somewhere else and let harden into instinct.

#redshift #postgres #oncall #data-engineering #airflow