Grafana Cloud quickstart guidesMigrating a Kube-Prometheus Helm stackImporting Recording and Alerting rules

Importing recording and alerting rules

Introduction

In this guide you’ll import Kube-Prometheus’s recording and alerting rules to Grafana Cloud. Recording rules allow you to cache expensive queries at customizable intervals to reduce load on your Prometheus instances and improve performance. To learn more, please see Recording rules from the Prometheus docs. Alerting rules allow you to define alert conditions based on PromQL queries and your Prometheus metrics.

You should avoid evaluating the same recording rules in both your local Prometheus instance and on Grafana Cloud, as this will create additional data points for the same time series. You may wish to split up recording and alerting rule evaluation across your local Prometheus instance and Grafana Cloud Prometheus. This allows you to keep alerting and recording rule evaluation local to your cluster, and use Grafana Cloud for rules that require global, multi-cluster aggregations.

Warning: To prevent abuse, your stack has been limited by default to 20 rules per rule group and 35 rule groups. The Kube-Prometheus stack currently contains roughly 28 rule groups, and one group, kube-apiserver.rules, contains 21 rules. To import the full set of recording rules to Grafana Cloud, you will need to increase your default rules-per-group limit from 20, which you can do by opening a ticket with Support from the Cloud Portal. You can also break up this rule group into two smaller rule groups, and import these smaller groups, which fit within the 20 rules-per-group default limit. To learn more, please see Defining recording rules.

Note: To learn how to enable multi-cluster support for Kube-Prometheus rules, please see Enabling multi-cluster support (optional).

Step 1: Extract Kube-Prometheus rules from Prometheus Pod

In this step we’ll extract the Kube-Prometheus recording and alerting rules from the Prometheus Pod’s filesystem.

Since we’re using the Kube-Prometheus Helm chart in this quickstart, there are two ways to extract the rules YAML files:

  • kubectl exec into the Prometheus container and copy out the rules files, which Helm generated from the templated source
  • Use the helm template command to locally generate the K8s manifests, and extract the rules files from this output

In this guide, we’ll use the first method. If you’re not using Helm, you may wish to generate or extract the rules files directly from the Jsonnet source or import the Jsonnet source directly using a tool like Grizzly. These methods go beyond the scope of this guide. You can also find a reduced set of dashboards and rules in the kubernetes-mixin project’s GitHub repo. The Kubernetes mixin is a subset of Kube Prometheus stack.

To begin, fetch the latest release of cortex-tools from the Releases page.

Once you’ve confirmed that you can run cortextool locally, copy the rules files from your Prometheus instance’s container using kubectl exec:

kubectl exec -n "default" "prometheus-foo-kube-prometheus-stack-prometheus-0" -- tar cf - "/etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0" | tar xf -

In this command, replace:

  • prometheus-foo-kube-prometheus-stack-prometheus-0 with the name of your Prometheus Pod
  • /etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0 with the path to Prometheus’s rules directory. You can find this by port-forwarding to your Prometheus Pod using kubectl port-forward, accessing Prometheus’s configuration from the UI (Status -> Configuration), and searching for the rule_files parameter.

This will create a directory called etc in your current directory. Navigate through the nested hierarchy to locate a set of rules YAML files. Note that these files are symlinks to the actual rules definitions, which are in a hidden directory. Copy out these rule definitions to a more convenient location.

With the Kube Prometheus rules files available locally, we can upload them to Cloud Prometheus using cortex-tools.

Step 2: Load rules into Grafana Cloud Prometheus

In this step you’ll use cortex-tools to load the Kube-Prometheus stack recording and alerting rules into your Cloud Prometheus endpoint.

We’ll use the rules load command to load the defined rule groups to Grafana Cloud using the HTTP API.

Warning: You may need to increase your stack’s default ruler limits or break up rule groups as described in the introductory note before running this command.

Active Series Warning: Note that your active series usage will increase with this step.

cortextool rules load --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_api_key> *.yaml

This command will load the rules files into Cortex’s rule evaluation engine. If you encounter any errors, you can use the --log.level=debug flag to increase the tool’s verbosity.

Be sure to replace the parameters in the above command with the appropriate values. You can find these in your Grafana Cloud Portal. Be sure to omit the /api/prom path from the endpoint URL. You can learn how to create an API key by following Create a Grafana Cloud API key.

After loading the rules, navigate to your hosted Grafana instance, then Grafana Cloud Alerting in the left-hand navigation menu, and finally Rules. From here, select the appropriate datasource from the dropdown (ending in -prom). You should see a list of alerting and recording rules.

In the previous step, we limited metrics shipped to Grafana Cloud from the local Prometheus instance to only those referenced in dashboards. We now need to expand this set of metrics to include those referenced in the recording and alerting rules we just imported.

Step 3: Expand allowlist to capture rules metrics

In this step you’ll expand the allowlist of shipped metrics to include those referenced in Kube-Prometheus’s recording and alerting rules.

Active Series Warning: Note that your active series usage will increase with this step.

Open the values.yaml file you used to configure remote_write in Step 2, and modify it as follows:

prometheus:
  prometheusSpec:
    remoteWrite:
    - url: "<Your Cloud Prometheus instance remote_write endpoint>"
      basicAuth:
          username:
            name: kubepromsecret
            key: username
          password:
            name: kubepromsecret
            key: password
      writeRelabelConfigs:
      - sourceLabels:
        - "__name__"
        regex: ":node_memory_MemAvailable_bytes:sum|aggregator_unavailable_apiservice|aggregator_unavailable_apiservice_total|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_cluster_members|alertmanager_config_hash|alertmanager_config_last_reload_successful|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request:availability30d|apiserver_request:burnrate1d|apiserver_request:burnrate1h|apiserver_request:burnrate2h|apiserver_request:burnrate30m|apiserver_request:burnrate3d|apiserver_request:burnrate5m|apiserver_request:burnrate6h|apiserver_request_duration_seconds_bucket|apiserver_request_duration_seconds_count|apiserver_request_terminations_total|apiserver_request_total|cluster:node_cpu_seconds_total:rate5m|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code:apiserver_request_total:increase30d|code_resource:apiserver_request_total:rate5m|code_verb:apiserver_request_total:increase1h|code_verb:apiserver_request_total:increase30d|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_spec_completions|kube_job_status_succeeded|kube_node_spec_taint|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_status_phase|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_replicaset_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_shard_ordinal|kube_state_metrics_total_shards|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|kubernetes_build_info|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_Slab_bytes|node_namespace_pod:kube_pod_info:|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_packets_total|node_network_transmit_bytes_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_packets_total|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|node_textfile_scrape_error|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_vmstat_pgmajfault|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_config_last_reload_successful|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_notifications_alertmanagers_discovered|prometheus_notifications_errors_total|prometheus_notifications_queue_capacity|prometheus_notifications_queue_length|prometheus_notifications_sent_total|prometheus_operator_list_operations_failed_total|prometheus_operator_list_operations_total|prometheus_operator_managed_resources|prometheus_operator_node_address_lookup_errors_total|prometheus_operator_ready|prometheus_operator_reconcile_errors_total|prometheus_operator_reconcile_operations_total|prometheus_operator_syncs|prometheus_operator_watch_operations_failed_total|prometheus_operator_watch_operations_total|prometheus_remote_storage_failed_samples_total|prometheus_remote_storage_highest_timestamp_in_seconds|prometheus_remote_storage_queue_highest_sent_timestamp_seconds|prometheus_remote_storage_samples_failed_total|prometheus_remote_storage_samples_total|prometheus_remote_storage_shards_desired|prometheus_remote_storage_shards_max|prometheus_remote_storage_succeeded_samples_total|prometheus_rule_evaluation_failures_total|prometheus_rule_group_iterations_missed_total|prometheus_rule_group_rules|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_metadata_cache_entries|prometheus_target_scrape_pool_exceeded_label_limits_total|prometheus_target_scrape_pool_exceeded_target_limit_total|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_compactions_failed_total|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|prometheus_tsdb_reloads_failures_total|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket"
        action: "keep"
    replicaExternalLabelName: "__replica__"
    externalLabels: {cluster: "test"}

Note: In this guide, the metrics allowlist corresponds to version 16.12.0 of the Kube-Prometheus Helm chart. This list of metrics may change as dashboards and rules are updated, and you should regenerate the metrics allowlist using cortextool.

Here, we expand the allowlist of metrics to include those referenced in recording and alerting rules. As an additional step, we’ll turn off recording and alerting rule evaluation in our local Prometheus cluster.

Step 4: Disable local Prometheus rules evaluation

Now that you’ve imported the Kube Prometheus recording and alerting rules to Grafana Cloud, you may wish to turn off local Prometheus rule evaluation, as this will create additional data points and alerts for the same time series.

To disable local rule evaluation, add the following to your values.yaml Helm configuration file:

. . .
defaultRules:
  create: false
  rules:
    alertmanager: false
    etcd: false
    general: false
    k8s: false
    kubeApiserver: false
    kubeApiserverAvailability: false
    kubeApiserverError: false
    kubeApiserverSlos: false
    kubelet: false
    kubePrometheusGeneral: false
    kubePrometheusNodeAlerting: false
    kubePrometheusNodeRecording: false
    kubernetesAbsent: false
    kubernetesApps: false
    kubernetesResources: false
    kubernetesStorage: false
    kubernetesSystem: false
    kubeScheduler: false
    kubeStateMetrics: false
    network: false
    node: false
    prometheus: false
    prometheusOperator: false
    time: false

Roll out the changes using helm upgrade:

helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack

Once the changes have been rolled out, use port-forward to forward a local port to your Prometheus Service:

kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090

Navigate to http://localhost:9090 in your browser, and then Status and Rules. Verify that Prometheus has ceased evaluating recording and alerting rules.

To learn more, please see the Helm chart’s values.yaml file.

Conclusion

In this guide you learned how to migrate the Kube-Prometheus stack’s recording and alerting rules to Grafana Cloud Alerting.

To learn how to enable multi-cluster support for Kube-Prometheus rules, please see Enabling multi-cluster support (optional).