ferkakta.dev

Per-Tenant CloudWatch Log Isolation on EKS, or: Why I Stopped Using aws-for-fluent-bit

The starting assumption

I’m building ramparts, a multi-tenant compliance platform running on EKS. Each tenant gets a Kubernetes namespace – tenant-acme, tenant-globex, whatever – and the compliance controls require that their application logs land in isolated storage with 365-day retention. CMMC maps this to AU-2 (audit events), AU-3 (audit content), AU-11 (retention), and AC-4 (information flow isolation). A tenant cannot read another tenant’s container output.

The obvious first move was aws-for-fluent-bit, AWS’s own Helm chart and container image for shipping logs to CloudWatch. AWS service, AWS chart, AWS logging destination. The blessed path.

I assumed that “blessed” meant “safe.” That assumption cost me a day.

What I found inside the blessed chart

Five CVEs dropped against Fluent Bit in November 2025 — tag spoofing, path traversal, input validation bypass, authentication bypass, and stack buffer overflow (CVE-2025-12972, CVE-2025-12978, CVE-2025-12977, CVE-2025-12969, CVE-2025-12970). The upstream project patched them quickly. AWS’s distro lagged weeks behind. For a platform targeting CMMC, that’s a direct SI-2 (flaw remediation) problem: you can’t claim timely patching when your log shipper trails upstream by weeks on security fixes.

But the CVE lag was the clean, obvious problem. The mechanical problems were worse.

The aws-for-fluent-bit chart wires its CloudWatch output to the Go-based cloudwatch plugin – a translation layer that marshals Fluent-bit records through Go’s runtime before hitting the AWS API. The upstream Fluent-bit chart uses cloudwatch_logs, the C-native plugin that makes direct API calls. Same destination, different codepath, different failure modes. Users on the AWS chart were reporting CrashLoopBackoff on upgrades. The chart also had a known bug where the auto_create_group parameter was silently ignored in certain configurations.

The compliance gap nobody talks about

That last one matters for compliance. AU-9 (audit integrity) says you protect the audit system from unauthorized modification. If your log shipper can silently create log groups that bypass your Terraform-managed retention policies, you have an integrity gap. I need Terraform to own the log group lifecycle, and I need auto_create_group=false to actually mean false.

I switched to the upstream chart from fluent.github.io/helm-charts. More config to write, but every parameter does what it says.

The config that nobody documents

Here’s where the real time went. The upstream cloudwatch_logs plugin supports template variables for dynamic log group and stream names – essential for multi-tenant routing. But the interaction between template parameters and their required fallback counterparts is not documented in any coherent way. I learned these rules from deploying, watching it crash, and reading other people’s GitHub issues.

Rule one: log_group_name is required even when you use log_group_template. It’s not a default that gets overridden – it’s a fallback the plugin crashes without.

Rule two: same for streams. log_stream_name or log_stream_prefix must be set even when log_stream_template is present.

Rule three – this one was the most frustrating: record accessor template variables can only be separated by . or ,. Not /. If you write $kubernetes['namespace_name']/$kubernetes['pod_name'], the parser throws “bad input character” and the pod enters CrashLoopBackoff. The / looks natural in a log path context, but the record accessor grammar doesn’t allow it. The / belongs in the static parts of the template string.

The working output config:

[OUTPUT]
    Name              cloudwatch_logs
    Match             kube.*
    region            us-east-1
    log_group_name    /apps/tenants/fallback
    log_group_template /apps/tenants/$kubernetes['namespace_name']
    log_stream_name   unknown
    log_stream_template $kubernetes['pod_name'].$kubernetes['container_name']
    auto_create_group false

log_group_name is /apps/tenants/fallback – a group that doesn’t exist and will never be created because auto_create_group is false. If a log record somehow bypasses the template (missing Kubernetes metadata, for instance), it fails loudly instead of silently landing in a catch-all group. That’s the behavior I want.

Routing only tenant traffic

Not every namespace is a tenant. Fluent-bit runs as a DaemonSet and sees everything on the node – kube-system, amazon-cloudwatch, the operator namespace, all of it. A grep filter restricts the output to tenant namespaces:

[FILTER]
    Name    grep
    Match   kube.*
    Regex   $kubernetes['namespace_name'] ^tenant-

This drops all log records where the namespace doesn’t match ^tenant-. Clean, no ambiguity. The IRSA policy enforces the same boundary from the AWS side – the ramparts-fluent-bit role can only write to /apps/tenants/*:

{
  "Sid": "CloudWatchLogsWrite",
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogStream",
    "logs:PutLogEvents",
    "logs:DescribeLogStreams"
  ],
  "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/apps/tenants/*:*"
}

Even if someone tampers with the Fluent-bit config to remove the grep filter, the IAM boundary prevents writes outside the tenant log path. Defense in depth, applied to a log pipeline.

Terraform owns the log groups

Each tenant’s log group is created by the shared-tenants Terraform module, not by Fluent-bit:

resource "aws_cloudwatch_log_group" "app" {
  name              = "/apps/tenants/tenant-${var.tenant_name}"
  retention_in_days = 365

  tags = {
    Name    = "app-tenant-${var.tenant_name}"
    Purpose = "application-container-logs"
  }
}

365-day retention, set at creation time, enforced by Terraform state. Tenants get read-only access to their own log group through an IRSA policy attached to the bedrock service account – logs:DescribeLogStreams, logs:GetLogEvents, logs:FilterLogEvents, scoped to exactly one ARN. No logs:CreateLogGroup, no logs:PutLogEvents, no wildcard.

Why this actually matters outside compliance checkbox theater

Before per-tenant log isolation, debugging a tenant issue required kubectl exec and cluster admin credentials – a two-person dependency for what should be self-service. Now anyone with CloudWatch access to /apps/tenants/tenant-acme can tail live container logs from the console or CLI. No kubectl, no cluster credentials, no waiting for the one person who has them. The compliance requirement and the operational requirement turned out to be the same thing: isolated, accessible, durable logs.

What I’d tell someone starting this today

Use the upstream fluent-bit chart. Write the config yourself. The aws-for-fluent-bit chart optimizes for a quick start at the cost of control, and in multi-tenant compliance work, control is the whole point. Expect the template parameter documentation to be incomplete – set every fallback field, avoid / in record accessors, and test with auto_create_group false from the start so you find the crashes immediately instead of after you’ve shipped a log group naming scheme you can’t change.

#fluent-bit #eks #cloudwatch #multi-tenant #aws #cmmc #logging #helm