Reduce your Prometheus active series usage

You can configure Prometheus to drop any metrics not used by Kubernetes Monitoring. By default, Kube Prometheus scrapes almost every available endpoint in your cluster, which sends tens of thousands (possibly hundreds of thousands) of active series to Grafana Cloud. With the following steps, you configure Prometheus to only send the metrics referenced in the dashboards you just uploaded. You lose long-term retention for these series, however, the series can still be available locally for Prometheus’ default configured retention period.

Enable metric allowlisting with `write_relabel_configs`

To enable metric allowlisting, use the write_relabel_configs parameter of remote_write. This parameter lets you allowlist metrics using regular expressions. For more information, refer to Reduce metrics costs by filtering collected and forwarded metrics.

For these instructions, the metrics allowlist corresponds to version 16.12.0 of the Kube-Prometheus Helm chart. This list of metrics may change as dashboards and rules are updated. Regenerate the metrics allowlist using Grafana Mimirtool.

Note
mixin-metrics works on dashboard JSON and rules YAML files. mimirtool works against a Grafana instance and Cloud Prometheus instance.

In an unloaded 3-node Kubernetes cluster, Kube-Prometheus sends approximately 40k active series by default. The following allowlist configuration should reduce this volume to approximately 8k active series. These figures may vary depending on your cluster and workloads.

To enable metric allowlisting:

Open the values.yaml file you used to configure remote_write in the Migrate a Kube-Prometheus Helm stack to Grafana Cloud steps, and add the following writeRelabelConfigs block:

prometheus:
  prometheusSpec:
    remoteWrite:
    - url: "<Your Cloud Prometheus instance remote_write endpoint>"
      basicAuth:
          username:
            name: kubepromsecret
            key: username
          password:
            name: kubepromsecret
            key: password
      writeRelabelConfigs:
      - sourceLabels:
        - "__name__"
        regex: ":node_memory_MemAvailable_bytes:sum|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_request:availability30d|apiserver_request_total|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code_resource:apiserver_request_total:rate5m|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_node_status_allocatable|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_info|kube_pod_owner|kube_resourcequota|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket"
        action: "keep"
    replicaExternalLabelName: "__replica__"
    externalLabels: {cluster: "test"}

This block adds a writeRelabelConfigs section to Prometheus Operator’s RemoteWriteSpec. To learn more about Operator’s API spec, refer to RemoteWriteSpec from Prometheus Operator’s API docs. To learn more about the underlying remote_write configuration specification, refer to remote_write from the Prometheus documentation.

In this case, you extracted the set of metrics found in the Kube-Prometheus stack’s dashboards, and configured Prometheus to drop any metric not referenced in a dashboard.

Save and close the file.

Apply the changes using Helm:

helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack

Use port-forward to navigate to the Prometheus interface, and confirm the changes in Status -> Configuration:
```
kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090
```
Scroll down to remote_write, and verify that Prometheus has picked up your configuration changes. This may take a minute or two.
Navigate to your hosted Grafana instance, and check the Ingestion Rate (DPM) dashboard. After a minute or two, you should see a sharp drop. Since series are considered “active” if you’ve sent data points in the past 15-30 minutes, it takes a bit of time for your active series usage to drop.

Next steps

At this point, you configured Prometheus to only send metrics referenced in the dashboards that you’ve imported to Grafana Cloud. To further reduce the set of metrics you send and decrease the cardinality of some of your metrics, refer to the following resources:

Analyze Prometheus metrics costs
The mixin-metrics utility, which allows you to extract metrics from dashboard JSON and rules YAML
Promlabs’s Relabeler utility, which allows you to test and understand Prometheus relabeling configuration. In particular, the labeldrop parameter can help you quickly reduce metric cardinality.

Your Prometheus instance continues to evaluate recording rules and alerting rules locally, which you can optionally migrate to Grafana Cloud to further reduce the workload on your local Prometheus. When you migrate recording and alerting rules, you are able to take advantage of Cloud Prometheus’ increased scalability and reliability. You can also reference metrics and perform aggregations across multiple Prometheus instances in your recording and alerting rules. Refer to Import recording and alerting rules for detailed steps.