Reduce Kubernetes metrics usage
This guide describes some specific methods you can use to control your usage when shipping Prometheus metrics from a Kubernetes cluster.
Default deployments of preconfigured Prometheus-Grafana-Alertmanager stacks like kube-prometheus scrape and store tens of thousands of active series when launched into a K8s cluster.
A vanilla deployment of kube-prometheus in an unloaded 3-Node cluster, configured to remote_write
to Grafana Cloud will count towards roughly ~50,000 active series of metrics usage.
Using the methods in this guide, you can reduce this significantly by either allowlisting metrics to ship to Grafana Cloud, or denylisting high-cardinality unneeded metrics.
If you followed the steps in Installing Grafana Agent on Kubernetes or installed the Kubernetes integration, your metrics usage should already be relatively low as these are only configured to scrape the cadvisor
and kubelet
endpoints of your cluster nodes.
Enabling additional scrape jobs and shipping more metrics will increase active series usage.
If you’ve installed the kube-prometheus stack using Helm, please see Migrating a Kube-Prometheus Helm stack for a metrics allowlist specific to that stack.
Prerequisites
This guide assumes some familiarity with Kubernetes concepts and assumes that you have a Prometheus deployment running inside of your cluster, configured to remote_write
to Grafana Cloud.
To learn how to configure remote_write
to ship Prometheus metrics to Cloud, please see Prometheus metrics.
Steps to modify Prometheus’s configuration vary depending on how you deployed Prometheus into your cluster. This guide will use a default kube-prometheus installation with Prometheus Operator to demonstrate the metrics reduction methods. The steps in this guide can be modified to work with Helm installations of Prometheus, vanilla Prometheus Operator deployments, and other custom Prometheus deployments.
Deduplicating metrics data sent from high-availability Prometheus pairs
Note: Depending on the architecture of your metrics and logs collectors, you may not need to deduplicate metrics data. Be sure to confirm that you are shipping multiple copies of the same metrics before enabling deduplication.
This section shows you how to deduplicate samples sent from high-availability Prometheus deployments.
By default, kube-prometheus deploys 2 replicas of Prometheus for high-availability, shipping duplicates of scraped metrics to remote storage. Grafana Cloud can deduplicate metrics, reducing your metrics usage and active series by 50% with a small configuration change. This section implements this configuration change with the kube-prometheus stack. Steps are similar for any Prometheus Operator-based deployment.
Begin by navigating into the manifests
directory of the kube-prometheus code repository.
Locate the manifest file for the Prometheus Custom Resource, prometheus-prometheus.yaml
.
Prometheus Custom Resources are created and defined by Prometheus Operator, a sub-component of the kube-prometheus stack.
To learn more about Prometheus Operator, please see the prometheus-operator GitHub repository.
Scroll to the bottom of prometheus-prometheus.yaml
and append the following three lines:
replicaExternalLabelName: "__replica__"
externalLabels:
cluster: "your_cluster_identifier"
The replicaExternalLabelName
parameter changes the default prometheus_replica
external label name to __replica__
.
Grafana Cloud uses the __replica__
and cluster
external labels to identify replicated series to deduplicate.
The value for __replica__
corresponds to a unique Pod name for the Prometheus replica.
To learn more about external labels and deduplication, please see Sending data from multiple high-availability Prometheus instances. To learn more about these parameters and the Prometheus Operator API, consult API Docs from the Prometheus Operator GitHub repository.
For a Prometheus HA deployment without Prometheus Operator, it’s sufficient to create a unique __replica__
label for each HA Prometheus instance, and a cluster
label shared across both HA instances in your Prometheus configuration.
After saving and rolling out these changes, you should see your active series usage decrease by roughly 50%. It may take some time for data to propagate into your Billing and Usage Grafana dashboards, but you should see results fairly quickly in the Ingestion Rate (DPM) panel.
You can also drastically reduce metrics usage by keeping a limited set of metrics to ship to Grafana Cloud, instead of all metrics scraped by kube-prometheus in its default configuration.
Filtering and keeping kubernetes-mixin metrics (allowlisting)
This section shows you how to keep a limited set of core metrics to ship to Grafana Cloud, storing the rest locally.
The Prometheus Monitoring Mixin for Kubernetes contains a curated set of Grafana dashboards and Prometheus alerts to gain visibility into and alert on your cluster’s operations. The Mixin dashboards and alerts are designed by DevOps practitioners who’ve distilled their experience and knowledge managing Kubernetes clusters into a set of reusable core dashboards and alerts.
By default, kube-prometheus deploys Grafana into your cluster, and populates it with a core set of kubernetes-mixin dashboards. It also sets up the alerts and recording rules defined in the Kubernetes Mixin. To reduce your Grafana Cloud metric usage, you can selectively ship metrics essential for populating kubernetes-mixin dashboards to Grafana Cloud. These metrics will then be available for long-term storage and analysis, with all other metrics stored locally in your cluster Prometheus instances.
In this guide, we’ve extracted metrics found in kubernetes-mixin dashboards. You may want to include other metrics, such as those found in the mixin alerts.
To begin allowlisting metrics, navigate into the manifests
directory of the kube-prometheus code repository.
Locate the manifest file for the Prometheus Custom Resource, prometheus-prometheus.yaml
.
Prometheus Custom Resources are created and defined by Prometheus Operator, a sub-component of the kube-prometheus stack.
To learn more about Prometheus Operator, please see the prometheus-operator GitHub repository.
Scroll to the bottom of prometheus-prometheus.yaml
and append the following to your existing remoteWrite
configuration:
remoteWrite:
- url: "<Your Metrics instance remote_write endpoint>"
basicAuth:
username:
name: your_grafanacloud_secret
key: your_grafanacloud_secret_username_key
password:
name: your_grafanacloud_secret
key: your_grafanacloud_secret_password_key
writeRelabelConfigs:
- sourceLabels:
- "__name__"
regex: "apiserver_request_total|kubelet_node_config_error|kubelet_runtime_operations_errors_total|kubeproxy_network_programming_duration_seconds_bucket|container_cpu_usage_seconds_total|kube_statefulset_status_replicas|kube_statefulset_status_replicas_ready|node_namespace_pod_container:container_memory_swap|kubelet_runtime_operations_total|kube_statefulset_metadata_generation|node_cpu_seconds_total|kube_pod_container_resource_limits_cpu_cores|node_namespace_pod_container:container_memory_cache|kubelet_pleg_relist_duration_seconds_bucket|scheduler_binding_duration_seconds_bucket|container_network_transmit_bytes_total|kube_pod_container_resource_requests_memory_bytes|namespace_workload_pod:kube_pod_owner:relabel|kube_statefulset_status_observed_generation|process_resident_memory_bytes|container_network_receive_packets_dropped_total|kubelet_running_containers|kubelet_pod_worker_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|workqueue_queue_duration_seconds_bucket|container_network_transmit_packets_total|rest_client_request_duration_seconds_bucket|node_namespace_pod_container:container_memory_rss|container_cpu_cfs_throttled_periods_total|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes_used|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|kube_node_status_allocatable_memory_bytes|container_memory_cache|go_goroutines|kubelet_runtime_operations_duration_seconds_bucket|kube_statefulset_replicas|kube_pod_owner|rest_client_requests_total|container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|storage_operation_errors_total|scheduler_e2e_scheduling_duration_seconds_bucket|container_network_transmit_packets_dropped_total|kube_pod_container_resource_limits_memory_bytes|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate|storage_operation_duration_seconds_count|node_netstat_TcpExt_TCPSynRetrans|node_netstat_Tcp_OutSegs|container_cpu_cfs_periods_total|kubelet_pod_start_duration_seconds_count|kubeproxy_network_programming_duration_seconds_count|container_network_receive_bytes_total|node_netstat_Tcp_RetransSegs|up|storage_operation_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_volume_stats_available_bytes|scheduler_scheduling_algorithm_duration_seconds_bucket|kube_statefulset_status_replicas_current|code_resource:apiserver_request_total:rate5m|kube_statefulset_status_replicas_updated|process_cpu_seconds_total|kube_pod_container_resource_requests_cpu_cores|kubelet_pod_worker_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|container_memory_usage_bytes|workqueue_adds_total|container_network_receive_packets_total|container_memory_working_set_bytes|kube_resourcequota|kubelet_running_pods|kubelet_volume_stats_inodes|kubeproxy_sync_proxy_rules_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_count|apiserver_request:availability30d|container_memory_rss|kubelet_pleg_relist_interval_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_count|workqueue_depth|:node_memory_MemAvailable_bytes:sum|volume_manager_total_volumes|kube_node_status_allocatable_cpu_cores"
action: "keep"
The first chunk of this configuration defines remote_write
parameters like authentication and the Cloud Metrics Prometheus endpoint URL to which Prometheus ships scraped metrics.
To learn more about remote_write
, please see the Prometheus docs.
To learn about the API implemented by Prometheus Operator, please see the API Docs from the Prometheus Operator GitHub repository.
The writeRelabelConfigs
section instructs Prometheus to check the __name__
meta-label (the metric name) of a scraped time series, and match it against the regex defined by the regex
parameter.
This regex contains a list of all metrics found in the kubernetes-mixin dashboards.
Note: This guide is updated infreqently and this allowlist may grow stale as the mixin evolves. Also note that this allowlist was generated from the kubernetes-mixin dashboards only and does not include metrics referenced in alerting or recording rules.
The keep
action instructs Prometheus to “keep” these metrics for shipping to Grafana Cloud, and drop all others.
Note that this configuration applies only to the remote_write
section of your Prometheus configuration, so Prometheus will continue to store all scraped metrics locally.
If you have additional metrics you’d like to keep, you can append them to the regex
parameter or in an additional relabel_config
section.
When you’re done modifying prometheus-prometheus.yaml
, save and close the file.
Deploy the changes in your cluster using kubectl apply -f
or your preferred Kubernetes management tool.
You may need to restart or bring up new Prometheus instances to pick up the modified configuration.
After saving and rolling out these changes, you should be pushing far fewer active series. It may take some time for data to propagate into your Billing and Usage Grafana dashboards, but you should see results fairly quickly in the Ingestion Rate (DPM) panel. Any kubernetes-mixin dashboards imported into Grafana Cloud should continue to function correctly.
To test this, you can import a kubernetes-mixin dashboard into Grafana Cloud manually.
Importing a kubernetes-mixin dashboard into Grafana Cloud
Run the following command to get access to the Grafana instance running in your cluster:
kubectl --namespace monitoring port-forward svc/grafana 3000
In your web browser, navigate to http://localhost:3000 and locate the API Server dashboard, which contains panels to help you understand the behavior of the Kubernetes API server.
Click on Share Dashboard.
Next, click on Export, then View JSON. Copy the Dashboard JSON to your clipboard.
On Grafana Cloud, log in to Grafana, then to Manage Dashboards. Click on Import and in the Import via panel JSON field, paste in the dashboard JSON you just copied. Then, click Load. Optionally name and organize your dashboard, then hit Import to import it.
You should see your allowlisted metrics populating the dashboard panels. These metrics and this dashboard will be available in Grafana Cloud for long-term storage and efficient querying across all of your Kubernetes clusters.
You can also reduce metric usage by explicitly dropping high-cardinality metrics in your relabel_config
.
Filtering and dropping high-cardinality metrics (denylisting)
You can also selectively drop high-cardinality metrics and labels that you don’t anticipate needing to warehouse in Grafana Cloud.
To analyze your metrics usage and learn how to identify potential high-cardinality metrics and labels to drop, please see Analyzing Prometheus metric usage.
The following sample write_relabel_configs
drops a metric called alertmanager_build_info
.
This is not a high-cardinality metric, and is only used here for demonstration purposes.
Using similar syntax, you can drop high-cardinality labels that you don’t need.
write_relabel_configs:
- source_labels: [__name__]
regex: "alertmanager_build_info"
action: drop
This config looks at the __name__
series meta-label, corresponding to a metric’s name, and checks that it matches the regex set in the regex
field.
If it does, all matched series are dropped.
Note that if you add this snippet to the remote_write
section of your Prometheus configuration, you will continue to store the metric locally, but prevent it from being shipped to Grafana Cloud.
You can expand this snippet to capture other high-cardinality metrics or labels that you do not wish to ship to Grafana Cloud for long-term storage. Note that this example does not use the Kubernetes Prometheus Operator API and is standard Prometheus configuration.
To learn more about write_relabel_configs
, please see relabel_config
from the Prometheus docs.
Conclusion
This guide describes three methods for reducing Grafana Cloud metrics usage when shipping metric from Kubernetes clusters:
- Deduplicating metrics sent from HA Prometheus deployments
- Keeping “important” metrics
- Dropping high-cardinality “unimportant” metrics
This guide has purposefully avoided making statements about which metrics are “important” or “unimportant” — this will depend on your use case and production monitoring needs. To learn more about some metrics you may wish to visualize and alert on, please see the Kubernetes Mixin, created by experience DevOps practitioners and contributors to the Prometheus and Grafana ecosystem.
Related resources from Grafana Labs


