
Kubernetes Monitoring Helm chart v4: Biggest update ever!
The Kubernetes Monitoring Helm chart is the easiest way to send metrics, logs, traces, and profiles from your Kubernetes clusters to Grafana Cloud (or a self-hosted Grafana stack). And version 4.0 is the biggest update the chart has ever received.
Representing nearly six months of planning and development, it's designed to solve real pain points that users have hit as their monitoring setups have grown. The result is a chart that's more predictable, more flexible, and much easier to maintain, whether you manage one cluster or a hundred.
If terms like "Helm values file" or "DaemonSet" sound intimidating, don't worry. This post walks through every major change in plain language, explains the problem it solves, and shows what the fix looks like in practice.
Destinations
Destinations are where your telemetry data is sent, like a Prometheus server for metrics or a Loki instance for logs.
The issue
In v3, destinations were defined as a list:
# v3
destinations:
- name: prometheus
type: prometheus
url: https://prometheus.example.com/api/v1/write
- name: loki
type: loki
url: https://loki.example.com/loki/api/v1/push
This caused real headaches for teams that:
- Manage multiple clusters, where you have one common config file that all clusters share and then smaller files that tweak settings for each individual cluster (like different credentials or cluster names). With lists, those smaller files can't change a single property of one destination; they have to redefine the entire list, wiping out everything from the shared file.
- Deploy with GitOps tools like Argo CD, Terraform, or Flux. To override a single field (say, a password) requires referencing items by their position in the list:
--set destinations[0].auth.password=secret. Here,[0]just means "the first item in the list," not "the Prometheus destination." If someone later reorders the list so Loki comes first, that override silently applies to the wrong destination. - Split configuration across multiple files. When Helm combines two files that both define the same list, it doesn't merge them together. Instead it throws away the first list and keeps the second. So if your shared file defined two destinations and your per-cluster file only wanted to change a password on one of them, you'd have to redefine both destinations in full or lose the other one entirely.
The solution
In v4, destinations are defined as a map instead of a list. In a list, items are identified by their position (first, second, third). In a map, each item has a unique name that you choose, and that name never changes regardless of what order things appear in.
Here's what the same two destinations look like in v4:
# v4
destinations:
prometheus:
type: prometheus
url: https://prometheus.example.com/api/v1/write
loki:
type: loki
url: https://loki.example.com/loki/api/v1/push
Notice that prometheus and loki are now names at the top level, not items in a numbered list. If you need to override a single property, you refer to the destination by name:
--set destinations.prometheus.auth.password=secret
That path reads clearly ("set the password on the Prometheus destination"), and it won't break if you add, remove, or reorder destinations later. When you split configuration across multiple files, Helm can merge maps together intelligently, combining properties from each file rather than throwing one away.
Collectors
Each collector is an instance of Alloy running in your cluster. You might have one collector scraping metrics, another reading log files, and another handling one-off jobs such as cluster events.
The issue
In v3, the chart came with hard-coded collector names like alloy-metrics, alloy-logs, and alloy-singleton. Each name was tied to a specific deployment type (StatefulSet, DaemonSet, single-replica Deployment). This caused friction when:
- You were confined by the chart's predefined layout even if you wanted a different topology (like one DaemonSet that handles both logs and metrics).
- You wanted to add a collector for a new purpose but couldn't because the set of names was fixed.
- The names didn't tell you how it was deployed, so you had to dig into the docs to understand what you were getting. For example,
alloy-metrics: Is this a StatefulSet? Clustered? How many replicas? - When you enabled a feature such as
clusterMetrics, the chart decided on its own which collector to put it on (for example, routingclusterMetricstoalloy-metrics). You never saw this in your configuration file. That routing was buried in the chart's internal code, so understanding which feature ran where meant reading source code rather than your own config.
The solution
In v4, collectors are a map you define. You pick the names, and you apply one or more presets that describe the deployment shape:
# v4
collectors:
metrics-collector:
presets: [clustered, statefulset]
logs-collector:
presets: [filesystem-log-reader, daemonset]
events-collector:
presets: [singleton]
The chart includes seven presets:
Preset | What it does |
|---|---|
| Enables Alloy clustering so replicas share scrape targets |
| Deploys as a StatefulSet |
| Deploys one instance per node |
| Deploys as a standard Deployment |
| Ensures only a single replica runs |
| Mounts the node's `/var/log` directory for reading container log files |
| Grants elevated permissions (needed for some profilers) |
You can combine multiple presets on a single collector, and their effects stack. For example, [clustered, statefulset] gives you a StatefulSet with Alloy clustering turned on. That's exactly what the old alloy-metrics was configured as internally, but now you can see it clearly in your config instead of needing to look it up.
Features are explicitly assigned to a collector, so there's no hidden magic:
clusterMetrics:
enabled: true
collector: metrics-collector
podLogsViaLoki:
enabled: true
collector: logs-collector
If your setup only has a single collector, you don't need to add collector: to every feature. The chart sees there's only one option and uses it for everything automatically. But if you define two or more collectors, the chart won’t guess which one you want for each feature. If you forget to specify, it will give you a message telling you which feature still needs to be assigned to a collector rather than silently picking one for you.
Telemetry services
Telemetry services are additional applications that run in your cluster to generate monitoring data. For example, Node Exporter collects hardware and OS metrics from your Linux nodes, kube-state-metrics tracks the state of Kubernetes objects like pods and deployments, and OpenCost calculates the cost of running your workloads. These services produce the raw data that your collectors then pick up and send to your destinations.
The issue
In v3, when you enabled a feature like clusterMetrics, this would silently deploy backing services such as Node Exporter, kube-state-metrics, and OpenCost behind the scenes. That was convenient if you were starting from scratch, but caused problems if:
- Your cluster already had Node Exporter running, and you'd get a duplicate deployment with no warning.
- You wanted the backing service without the Alloy configuration (or vice versa), but they were inseparable.
- You wanted fine-grained control over which services were deployed, but the chart bundled them all together under one feature flag.
The solution
In v4, deploying those services is a separate, explicit step under the telemetryServices key:
# v4
telemetryServices:
kube-state-metrics:
deploy: true
node-exporter:
deploy: true
opencost:
deploy: true
If you already have one of these services running in your cluster, you skip the deploy and just point the chart to the existing instance:
telemetryServices:
node-exporter:
deploy: false # don't deploy a new one
hostMetrics:
enabled: true
linuxHosts:
enabled: true
namespace: monitoring
labelMatchers:
app.kubernetes.io/name: prometheus-node-exporter
No more surprise deployments. You can mix-and-match: deploy some services via the chart and reuse others that already exist. The chart's built-in validation will tell you if a feature needs a service you haven't enabled or pointed at yet, so you're never left guessing.
Cluster metrics
Cluster metrics are the numbers that tell you how your Kubernetes cluster is doing: how much CPU and memory your containers are using, how many pods are running or failing, whether nodes are healthy, and so on. These come from sources like the Kubelet (the agent on each node), cAdvisor (which tracks container resource usage), and kube-state-metrics (which reports on the state of Kubernetes objects like Deployments, Jobs, and Pods).
The issue
The v3 clusterMetrics feature was overloaded. When enabled, it configured collection for Kubernetes cluster metrics, Linux host metrics (Node Exporter), Windows host metrics (Windows Exporter), energy metrics (Kepler), and cost metrics (OpenCost), all at once. This caused issues when:
- You only cared about Kubernetes metrics but didn't want host or cost metrics. You had to hunt through a large values file to disable what you didn't need.
- You wanted to manage host metrics independently from cluster metrics, but they were tangled together in the same feature.
- The values file was cluttered with options for all five concerns, making it hard to find the setting you actually needed.
The solution
V4 splits this into three focused features, each with its own values file:
v4 feature | What it covers |
|---|---|
| Kubelet, cAdvisor, kube-state-metrics, control plane |
| Linux hosts (Node Exporter), Windows hosts (Windows Exporter), energy (Kepler) |
| OpenCost (and, in the future, cloud cost exporters) |
Each feature's configuration only shows the options relevant to that concern, nothing else. The separation also opens the door for future improvements. For example, a future release could let you choose between Node Exporter and Alloy's built-in host metric collection with a single toggle, potentially eliminating a deployment entirely.
Pod logs
Pod logs are the text output that your applications write while they're running, like "user logged in," "request failed with error 500," or "database connection timed out." Every container in your cluster produces these logs, and collecting them in a central place is one of the most common reasons to set up monitoring.
The issue
The v3 podLogs feature had a gatherMethod flag that completely changed how it behaved. Depending on the value, you were either collecting logs through Loki's pipeline or through OpenTelemetry's filelog receiver. This was confusing because:
- The values file mixed configuration for both methods together. Options were littered with notes like "this only applies if gatherMethod is X," making it hard to know which settings were relevant to you.
- The OpenTelemetry path was needlessly wasteful. It collected logs in OpenTelemetry format, translated them into Loki format, and then translated them back into OpenTelemetry format for delivery. Attributes were lost along the way.
- Switching between methods meant changing a single flag and hoping the rest of your configuration still made sense for the new method.
The solution
V4 replaces this with two separate features:
v4 feature | Equivalent in v3 |
|---|---|
|
|
|
|
Each feature's values file only shows options that apply to that method. The OpenTelemetry path now delivers logs natively in OTLP format with no round-trip translation, so all original attributes are preserved.
Pod log labels
Every pod in Kubernetes can have labels (like app: frontend or team: payments) and annotations attached to it. When collecting logs, it's useful to carry some of those labels along with each log line so you can later filter or search by them. In v3, labelsToKeep was a configuration option that controlled which of those labels survived the collection process.
The issue
In v3, the pod logs pipeline would take every Kubernetes pod label and annotation and turn them into log labels, then use a labelsToKeep list to filter down to just the ones you wanted. This caused three distinct problems:
- Customization was painful. If you needed to keep one extra label, you had to redefine the entire default list (about 12 items) and add yours at the end, the same list-override problem that plagued destinations.
- Memory usage spiked. Alloy allocated labels for every annotation on every pod, potentially hundreds, only to throw most of them away moments later. Users reported their log-collecting Alloy instances struggling with memory, traced directly to this bulk-label behavior.
- The config was unintuitive. You didn't opt in to labels you wanted; you started with everything and had to figure out which ones to keep.
The solution
In v4, labelsToKeep is removed entirely. Pod annotations and labels are no longer bulk-applied. Instead, you explicitly declare which pod labels and annotations you want promoted to log labels:
podLogsViaLoki:
enabled: true
labels:
- app
- team
annotations:
- release-version
nly the labels you ask for are created. Memory usage drops because Alloy never allocates the ones you don't need. Adding a label is a one-line change, no list redefinition required.
Profiling
Profiling is a way to look inside your running applications to see exactly where they're spending their time and resources. While metrics tell you that something is slow or using too much memory, profiling tells you why, down to the specific function or line of code. The chart supports three types:
- eBPF profiling, which uses a Linux kernel feature to observe applications without modifying them
- Java profiling, which hooks into the Java runtime
- pprof, which collects profiles from applications that expose a standard profiling endpoint (common in Go applications)
The issue
The v3 profiling feature bundled three profiler types together: eBPF, Java, and pprof. Enabling profiling deployed all of them, even if you only needed one. This was wasteful because:
- eBPF and Java profilers require elevated permissions and dedicated resources, but you'd get them even if you only wanted lightweight pprof collection.
- There was no way to run just one profiler without also deploying the others.
- The profiling collector was a separate, dedicated deployment. You couldn't share it with other features.
The solution
In v4, you enable the profiling feature and then separately enable just the profilers you need:
profiling:
enabled: true
ebpf:
enabled: true
pprof:
enabled: true
java:
enabled: false
This approach is better in three ways:
- The chart only creates the specific labels you listed, so nothing extra gets generated in the first place.
- Alloy no longer has to hold hundreds of unused labels in memory just to throw them away, which directly reduces memory consumption.
- If you want to start keeping one more label, you just add a single line to your config instead of having to copy the entire default list and append to it.
Migrating from previous versions
A migration tool is available at grafana.github.io/k8s-monitoring-helm-migrator/. Drop in your current values file and get a v4-compatible values file back. The tool handles the structural changes: converting lists to maps, splitting features, and mapping the old named collectors to the new preset-based system.
All of the chart's examples on the Grafana documentation site and in the repository's examples directory have been updated to reflect the v4 format.
Summary
Every change in v4 follows the same pattern: identify a real pain point (fragile list overrides, hidden deployments, overloaded features, wasted memory) and restructure the chart to eliminate it.
Ready to get started? Check out the migration tool, browse the updated examples, or dive into the chart documentation.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!