Reduce your Prometheus active series usage
In this guide you’ll configure Prometheus to drop any metrics not referenced in the Kube-Prometheus stack’s dashboards.
By default, Kube Prometheus will scrape almost every available endpoint in your cluster, shipping tens of thousands (possibly hundreds of thousands) of active series to Grafana Cloud. In this guide we’ll configure Prometheus to only ship the metrics referenced in the dashboards we’ve just uploaded. You will lose long-term retention for these series, however, they will still be available locally for Prometheus’s default configured retention period.
Enable metric allowlisting with write_relabel_configs
To enable metric allowlisting we’ll take advantage of remote_write
’s write_relabel_configs
parameter. This parameter allows us to allowlist metrics using regular expressions. To learn more about this feature, see Controlling remote write behavior using write_relabel_configs.
In this guide, the metrics allowlist corresponds to version 16.12.0
of the Kube-Prometheus Helm chart. This list of metrics may change as dashboards and rules are updated, and you should regenerate the metrics allowlist using mimirtool.
Note:
mixin-metrics
works on dashboard JSON and rules YAML files, andmimirtool
works against a Grafana instance and Cloud Prometheus instance.mimirtool
does not support rules; this feature is under active development.
In an unloaded 3-node Kubernetes cluster, Kube-Prometheus will ship roughly 40k active series by default. The following allowlist configuration should reduce this volume to roughly 8k active series. These figures may vary depending on your cluster and workloads.
Open the values.yaml
file you used to configure remote_write
in Step 2 of Migrating a Kube-Prometheus Helm stack to Grafana Cloud, and add the following writeRelabelConfigs
block:
prometheus:
prometheusSpec:
remoteWrite:
- url: "<Your Cloud Prometheus instance remote_write endpoint>"
basicAuth:
username:
name: kubepromsecret
key: username
password:
name: kubepromsecret
key: password
writeRelabelConfigs:
- sourceLabels:
- "__name__"
regex: ":node_memory_MemAvailable_bytes:sum|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_request:availability30d|apiserver_request_total|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code_resource:apiserver_request_total:rate5m|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_node_status_allocatable|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_info|kube_pod_owner|kube_resourcequota|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket"
action: "keep"
replicaExternalLabelName: "__replica__"
externalLabels: {cluster: "test"}
This block adds a writeRelabelConfigs
section to Prometheus Operator’s RemoteWriteSpec
. To learn more about Operator’s API spec, see RemoteWriteSpec from Prometheus Operator’s API docs. To learn more about the underlying remote_write
configuration specification, see remote_write from the Prometheus docs.
In this case we’ve extracted the set of metrics found in the Kube-Prometheus stack’s dashboards, and configured Prometheus to drop any metric not referenced in a dashboard.
When you’re done editing the file, save and close it. Roll out the changes using Helm:
helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack
Use port-forward
to navigate to the Prometheus UI and confirm the changes in Status -> Configuration:
kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090
Scroll down to remote_write
and verify that Prometheus has picked up your configuration changes. This may take a minute or two.
Navigate to your hosted Grafana instance, and check the Ingestion Rate (DPM) dashboard. After a minute or two, you should see a sharp drop. Since series are considered “active” if you’ve shipped data points in the past 15-30 minutes, it will take a bit of time for your active series usage to drop.
Summary
At this point you’ve configured Prometheus to only ship metrics referenced in the dashboards you’ve imported to Grafana Cloud. To further reduce the set of metrics you ship and decrease the cardinality of some of your metrics, see the following resources:
- Control Prometheus metrics usage from the Grafana Cloud docs
- Finding unused metrics with mimirtool, which allows you identify unused metrics you’re shipping to Grafana Cloud
- The mixin-metrics utility, which allows you to extract metrics from dashboard JSON and rules YAML
- Promlabs’s Relabeler utility, which allows you to test and understand Prometheus relabeling configuration. In particular, the
labeldrop
parameter can help you quickly reduce metric cardinality.
Your Prometheus instance continues to locally evaluate recording rules and alerting rules, which you can optionally migrate to Grafana Cloud to further reduce the workload on your local Prometheus. Migrating recording and alerting rules also allows you to take advantage of Cloud Prometheus’s increased scalability and reliability, as well as reference metrics and perform aggregations across multiple Prometheus instances in your recording and alerting rules.