I assumed model conversion worked like compilation. It doesn’t.

2026-02-20

I sit at an M3 Max MacBook Pro that runs x86 Docker images through Rosetta. I’ve spent my career watching architecture boundaries dissolve.

So when the ML engineer told me the TensorRT model conversion had to run on the Jetson — the actual ARM edge device, not the x86 GPU server sitting next to it — I assumed it was cargo cult.

It wasn’t.

The pipeline

We were shipping an RT-DETR object detection model to NVIDIA Jetson Orin devices on drones.

The conversion path looked simple:

PyTorch → ONNX → TensorRT engine.

PyTorch to ONNX is portable. ONNX is a serialization format.

TensorRT is where portability ends.

Before I got involved, the deployment process was a sequence of SSH sessions and SCP commands. One engineer’s laptop was effectively the CI system.

I replaced it with a two-stage Ansible pipeline:

Stage 1 (x86 GPU server): PyTorch → ONNX
Stage 2 (Jetson Orin): ONNX → TensorRT engine

S3 is the handoff. Jenkins calls one playbook. The playbook spans two hosts and two architectures.

Why the boundary exists

My mental model was wrong.

I was thinking of TensorRT as a compiler.

Take model in. Emit optimized binary out. Target whatever architecture you want.

That’s how GCC works. That’s how Rosetta works. Translate instructions and ship.

TensorRT doesn’t translate.

It profiles.

When trtexec converts an ONNX model to a TensorRT engine, it runs the layers on the actual GPU. It benchmarks hundreds of kernel implementations per layer — different algorithms, memory layouts, fusion strategies, precision modes. It tries hundreds of combinations and selects the fastest one for that specific hardware.

The output isn’t:

“This model in TensorRT format.”

It’s:

“This model restructured around the tensor cores, memory hierarchy, and parallelism of this exact GPU.”

An engine built on an A100 encodes A100 assumptions: bandwidth, tensor core count, kernel latency. Deploy it on a Jetson Orin and those assumptions are wrong.

NVIDIA’s position is blunt:

TensorRT engine is not portable so it’s required to be compiled on the exact target.

The ONNX file is portable.

The TensorRT engine is hardware-married.

That’s why the pipeline has two stages.

The Rosetta analogy is wrong

Rosetta preserves semantics and changes encoding.

TensorRT preserves intent and changes structure.

Two GPUs with different architectures get different operation graphs from the same ONNX input.

It’s not translation.

It’s specialization.

Asking “why can’t I build this on x86 for ARM?” is like asking “why can’t I benchmark on one machine and deploy on another?”

You can.

The numbers will just be wrong.

What I actually built

The Ansible pipeline was my first Ansible project.

Deployment went from thirty minutes of SSH sessions and SCP commands — and only one person knew the steps — to under ten minutes with a single command that anyone on the team could run.

Beyond the playbooks:

A Jenkinsfile orchestrating conversion
A JetPack 6 Dockerfile for TensorRT on the Jetson
systemd units for the edge inference API
Bats tests on the edge entrypoint script
An fzf-based model tag selector wired into ECR

The inventory defines two host groups:

[onprem] — x86 GPU server
[jetson] — ARM edge device

Same SSH configuration. Different architectures. Explicit boundary.

The fleet of drones never materialized.

The pipeline paid for itself anyway.

The lesson

Systems behave correctly.

My mental model was wrong.

TensorRT’s non-portability isn’t a tooling gap. It’s the mechanism that makes it fast.

The hardware-specific optimization is the product.

I assumed the architecture boundary was artificial.

It’s physical.

#ansible #ml-ops #tensorrt #jetson #nvidia #edge