ferkakta.dev

Why we removed aws-for-fluent-bit from EKS

We deployed aws-for-fluent-bit because AWS recommends it.

If you follow the EKS logging documentation, that’s the default path. It assumes you use AWS’s distribution of Fluent Bit rather than the upstream Helm chart.

We did.

Two days later, we ripped it out.

The AWS chart and the upstream chart are not the same thing. The differences aren’t cosmetic. They affect how quickly you receive security patches, how transparently your configuration maps to the underlying plugin, and how many boundaries sit between your logs and the CloudWatch API.

That boundary turned out to matter.

The plugin path isn’t the same

The AWS chart routes logs through a Go-based cloudwatch output plugin – effectively a wrapper around the CloudWatch API.

The upstream chart uses the C-native cloudwatch_logs plugin directly.

Both send logs to CloudWatch. But one inserts an additional translation layer between your configuration and the API call.

When logs don’t arrive in a specific log group, you want your config to map as directly as possible to the API. A wrapper means your configuration is rewritten before it reaches the code that actually makes the request.

That’s an extra hop.

The upstream chart is one hop from the API. The AWS chart is several.

When you’re debugging production logging, that difference isn’t abstract.

Release cycles become your problem

In November 2025, five CVEs landed against Fluent Bit – tag spoofing, path traversal, input validation bypass, authentication bypass, and stack buffer overflow (CVE-2025-12972, CVE-2025-12978, CVE-2025-12977, CVE-2025-12969, CVE-2025-12970).

The upstream project patched within days.

AWS’s distribution lagged weeks behind.

When you run a vendor wrapper, you inherit the vendor’s release cycle.

In a multi-tenant platform, where one tenant’s log stream should never interfere with another’s, weeks is not a rounding error.

That’s another boundary – time.

Renaming isn’t neutral

Here’s what the AWS chart exposes:

cloudWatch:
  enabled: true
  region: us-east-1
  logGroupTemplate: "/apps/tenants/$kubernetes['namespace_name']"
  logStreamTemplate: "$kubernetes['pod_name']/$kubernetes['container_name']"
  autoCreateGroup: false

And here’s the upstream chart with raw Fluent Bit config:

config:
  outputs: |
    [OUTPUT]
        Name                cloudwatch_logs
        Match               kube.*
        region              us-east-1
        log_group_name      /apps/tenants/fallback
        log_group_template  /apps/tenants/$kubernetes['namespace_name']
        log_stream_name     unknown
        log_stream_template $kubernetes['pod_name'].$kubernetes['container_name']
        auto_create_group   false

The AWS chart renames every parameter.

auto_create_group becomes autoCreateGroup. log_group_template becomes logGroupTemplate.

Helm values conventionally use camelCase. The upstream plugin uses snake_case. So the AWS chart rewrites the names – and then translates them back before handing them to the plugin that only understands the original snake_case.

They repainted the horse to match their barn, then built a machine to unpaint it before the horse goes outside.

You now have to understand two vocabularies for the same concept. The only thing the second vocabulary buys you is consistent casing in a values file.

That’s cognitive latency for aesthetic symmetry.

The autoCreateGroup bug

In practice, the abstraction isn’t just cosmetic.

The AWS chart’s autoCreateGroup parameter had a known bug where the value was silently ignored. Fluent Bit would create log groups even when configured not to.

If your log groups are Terraform-managed – ours are, with retention policies, KMS encryption, and resource tags – a daemonset quietly creating unmanaged log groups is a drift factory.

You don’t notice immediately. You notice later, when your logging layer has started mutating infrastructure you believed was declarative.

That’s not a logging issue. That’s a boundary violation.

Upgrade stability matters more than features

Multiple GitHub issues document CrashLoopBackoff after upgrading the AWS chart.

When your logging daemonset is in CrashLoopBackoff, you’re not just missing logs – you’re blind during exactly the kind of incident where you need them.

The upstream chart’s upgrade path is simpler. Fewer layers. Fewer translations. Fewer places for behavior to diverge from expectation.

Again: fewer boundaries.

The upstream chart isn’t perfect

The upstream configuration is more direct, but it isn’t frictionless.

Two things will bite you:

The cloudwatch_logs plugin requires log_group_name and log_stream_name as mandatory fallbacks even when you use templates. Omit them and the plugin refuses to start.

Template variable separators can only be . or , – not /. Using / in log_stream_template crashes the record accessor parser. The error message does not tell you this. You discover it by reading the Fluent Bit source.

But these are direct constraints from the actual plugin.

There’s no translation layer rewriting your config before the plugin sees it.

When something fails, it fails at the real boundary.

The principle

When an upstream project ships the plugin and the vendor ships a wrapper around it, the wrapper adds distance.

Distance in release cycles. Distance in naming. Distance in debugging. Distance between your config and the API call.

The upstream chart is one hop from CloudWatch. The AWS chart is several.

Every hop is latency.

We chose the recommended default because it was recommended.

Two days later, we removed the wrapper.

Not because AWS is wrong.

Because fewer boundaries mean fewer surprises.

#fluent-bit #eks #aws #cloudwatch #logging #helm