I was guessing build times from my laptop. A six-cent spot instance proved me wrong.

2026-03-16

I needed to know how fast a 16GB Docker image builds on Graviton. The image is a Docling container — document intelligence, heavy Python dependencies, big model downloads. I was sizing a self-hosted runner for CI and needed real numbers: build duration, peak memory, CPU profile. The numbers would determine whether I could get away with an r6g.large or needed to step up to an xlarge.

I did what everyone does first. I ran the build on my laptop.

The laptop lied

My M3 Max reported 5Gi peak memory usage during the build. Docker Desktop surfaces this through its VM — the number comes from the hypervisor, not from the build process itself. I took it at face value and started planning around 5Gi as the memory floor.

That number was wrong by 60%.

The M3 Max runs Docker inside a Linux VM. The VM has its own page cache, its own buffer management, its own memory accounting overhead. When Docker Desktop reports “5Gi used,” that includes the VM’s operating system footprint, the container runtime, the image layer cache in memory, and the actual build process. The build itself needs under 2Gi. The other 3Gi is the cost of running Linux-on-macOS-on-Apple-Silicon to get a Linux container.

I was about to size infrastructure based on a measurement that included 60% overhead from a hypervisor that wouldn’t exist in production. This is the kind of mistake that gets baked into Karpenter node pool configs and Kubernetes resource requests, where it quietly wastes money for months before anyone notices.

I asked the real hardware

I launched a spot instance. Graviton, Amazon Linux 2023, same architecture as the production runner. The user-data script installs Docker, starts the build, collects metrics, pushes results to S3, and terminates the instance. Total wall-clock time from launch to termination: under 15 minutes. Total cost: six-tenths of a penny.

The core of the harness is a bash script embedded in EC2 user-data. It does four things: installs instrumentation, runs the build, collects results, and cleans up after itself.

#!/bin/bash
set -ex

COLLECTION_INTERVAL=2
REGION=$(curl -sf http://169.254.169.254/latest/meta-data/placement/region)
TOKEN=$(curl -sf -X PUT http://169.254.169.254/latest/api/token \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
INSTANCE_TYPE=$(curl -sf -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-type)
INSTANCE_ID=$(curl -sf -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id)

# Install everything
dnf install -y --allowerasing sysstat docker curl
rpm -Uvh https://amazoncloudwatch-agent.s3.amazonaws.com/amazon_linux/arm64/latest/amazon-cloudwatch-agent.rpm || true
systemctl start docker

That’s the preamble. IMDSv2 for metadata, dnf because it’s AL2023, and the CloudWatch Agent RPM pulled directly from Amazon’s bucket. The || true on the RPM install is because the agent is occasionally pre-installed on newer AMIs and rpm -Uvh returns non-zero if the package already exists.

Watching live, querying after

I run two collection systems simultaneously: the CloudWatch Agent and sysstat. They measure the same system from different angles, and the cost of running both is zero.

The CloudWatch Agent streams metrics to a custom namespace in real time. Memory, CPU, disk I/O, network, swap — all at 2-second resolution. The moment the build starts, I can open a CloudWatch dashboard and watch memory climb. This is live observability. If something goes sideways — an OOM kill, a disk filling up, a network timeout — I see it while the instance is still running.

{
  "metrics": {
    "namespace": "BuildBench",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_used",
                        "mem_available", "mem_cached"],
        "metrics_collection_interval": 2
      },
      "cpu": {
        "measurement": ["cpu_usage_idle", "cpu_usage_user",
                        "cpu_usage_system", "cpu_usage_iowait"],
        "metrics_collection_interval": 2,
        "totalcpu": true
      }
    },
    "append_dimensions": {
      "InstanceType": "${aws:InstanceType}",
      "InstanceId": "${aws:InstanceId}"
    }
  }
}

sysstat runs in parallel. The sa1 collector writes to a binary sa file every 2 seconds — the same binary format that sar reads. After the build finishes, I query it:

# Start collection in background
/usr/lib64/sa/sa1 ${COLLECTION_INTERVAL} 32767 &
SAR_PID=$!

# ... build runs ...

# After the build: extract everything
sar -r ALL > /tmp/results/memory.txt
sar -u ALL > /tmp/results/cpu.txt
sar -d     > /tmp/results/disk.txt
sar -n DEV > /tmp/results/network.txt
sar -q     > /tmp/results/load.txt
sar -w     > /tmp/results/context-switches.txt

This is the Brendan Gregg methodology. The kernel instruments everything. You don’t write sampling loops. You don’t parse /proc/meminfo in a while true — you turn on the collection daemon that already knows how to do this, and you query it after the fact. sar -r for memory, sar -u for CPU, sar -d for disk. The data is already there, stored in a compact binary format designed for exactly this kind of forensic analysis.

CloudWatch is for live dashboards. sysstat is for post-mortem deep dives. Running both costs nothing and gives me two independent views of the same system.

When it’s done, it deletes itself

The script computes a summary, pushes everything to S3, and kills itself:

PEAK_USED_KB=$(sar -r | grep -v '^$\|Linux\|Average\|RESTART\|kbmem' \
  | awk '{print $4}' | sort -n | tail -1)
PEAK_USED_MB=$((PEAK_USED_KB / 1024))

cat > /tmp/results/summary.json <<EOF
{
  "instance_type": "$INSTANCE_TYPE",
  "build_duration_seconds": $DURATION,
  "peak_memory_used_mb": $PEAK_USED_MB,
  "cloudwatch_namespace": "BuildBench"
}
EOF

aws s3 cp /tmp/results/ "s3://${RESULTS_BUCKET}/${RESULTS_PREFIX}/" \
  --recursive --region $REGION
aws ec2 terminate-instances --instance-ids $INSTANCE_ID --region $REGION

The instance needs an IAM instance profile with s3:PutObject, ec2:TerminateInstances, and cloudwatch:PutMetricData. That’s it. No SSH key, no security group ingress, no elastic IP. The instance launches, does its job, reports its findings, and disappears.

Three machines, three answers

Instance	vCPUs	RAM	Build Time	Peak Memory	Spot Cost
r6g.large	2	16Gi	12m 00s	962 MB	$0.006
r6g.xlarge	4	32Gi	7m 52s	1,478 MB	$0.008
M3 Max (local)	16	128Gi	4m 39s	5,002 MB*	$0.00

* Docker Desktop VM overhead. The build itself uses around 1.6Gi.

The r6g.large — two vCPUs, 16Gi RAM — built the image in 12 minutes and peaked at 962MB. The r6g.xlarge cut the time to under 8 minutes with more memory headroom. Both cost less than a penny.

The laptop was faster in wall-clock time because it has 16 cores and NVMe storage. But the memory number it reported was fiction. If I’d set a Kubernetes resource request of 5Gi based on the Docker Desktop measurement, every build pod would have reserved 3.5Gi of memory it never used. Across a fleet of concurrent builds, that’s the difference between fitting on an r6g.xlarge and needing an r6g.2xlarge.

Four things bit me

A few things bit me along the way.

User-data encoding. The AWS CLI run-instances command accepts --user-data as a string, but if you’re passing it from a file, you need the file:// prefix. If you’re using the console or CloudFormation, it needs base64 encoding. I wasted 10 minutes on a launch that ran but executed nothing because the script wasn’t actually delivered.

Private repos. The Dockerfile references a private repository’s build context. On a spot instance with no git credentials, you can’t git clone and you can’t curl a private raw URL. I embedded the Dockerfile directly in the user-data script as a heredoc. Ugly, but it works and there’s no credential management to deal with.

Default security groups. A VPC’s default security group allows all outbound traffic — unless someone has modified it. I launched into a VPC where the default SG had been tightened, and the instance couldn’t reach Docker Hub or the CloudWatch Agent endpoint. The fix was specifying a security group with explicit egress rules. The symptom was a build that hung forever on dnf install with no error message.

Spot capacity. Larger Graviton instances in popular AZs run out of spot capacity. I requested an r6g.2xlarge in us-east-1a and got an InsufficientInstanceCapacity error. Dropping to r6g.xlarge or trying a different AZ worked immediately. For benchmarking, this doesn’t matter — you’re not running a fleet, you just need one instance for a few minutes.

It’s not really about Docker

This is not specific to Docker builds. The harness answers one question: “How fast does X run on Y, and how much memory does it need?” Swap the Dockerfile for a machine learning training script, a compiler benchmark, a database import, a video transcode. Swap the instance type for any hardware question — Graviton vs. Intel, GPU vs. CPU, io2 vs. gp3.

The instrumentation is the same regardless of workload. CloudWatch Agent streams live metrics. sysstat collects forensic data. The script pushes results and self-terminates. The cost is always fractions of a penny because spot instances bill by the second and the instance doesn’t survive past the measurement.

The tool is BuildBench. It takes an instance type, a build script, a collection interval, and an S3 bucket. It launches the spot instance, instruments it, runs the build, self-terminates, and writes a summary with duration, peak memory, and cost to S3. The harness handles failure cases too: an EXIT trap uploads cloud-init logs and a diagnostic file to a diagnostics/ prefix regardless of whether the build reached completion, so failed runs leave evidence rather than silence. The goal is to make “I don’t know, let me measure” faster than “I think it’s about 5Gi based on my laptop.”

Stop guessing. Measure on the target. It costs less than a penny.

#aws #docker #benchmarking #graviton