Menu
Grafana Cloud

Import recording and alerting rules

In this guide, you’ll import Kube-Prometheus’s recording and alerting rules to Grafana Cloud. Recording rules let you cache expensive queries at customizable intervals to reduce load on your Prometheus instances and improve performance. To learn more, see Recording rules from the Prometheus docs. Alerting rules allow you to define alert conditions based on PromQL queries and your Prometheus metrics.

You should avoid evaluating the same recording rules in both your local Prometheus instance and on Grafana Cloud, as this creates additional data points for the same time series. You may wish to split up recording and alerting rule evaluation across your local Prometheus instance and Grafana Cloud Prometheus. This lets you keep alerting and recording rule evaluation local to your cluster, and use Grafana Cloud for rules that require global, multi-cluster aggregations.

Warning: To prevent abuse, your stack has been limited by default to 20 rules per rule group and 35 rule groups. The Kube-Prometheus stack currently contains roughly 28 rule groups, and one group, kube-apiserver.rules, contains 21 rules. To import the full set of recording rules to Grafana Cloud, you will need to increase your default rules-per-group limit from 20, which you can do by opening a ticket with Support from the Cloud Portal. You can also break up this rule group into two smaller rule groups, and import these smaller groups, which fit within the 20 rules-per-group default limit. To learn more, see Defining recording rules.
Note: To learn how to enable multi-cluster support for Kube-Prometheus rules, see Enable multi-cluster support.

Extract Kube-Prometheus rules from Prometheus Pod

In this section, you’ll extract the Kube-Prometheus recording and alerting rules from the Prometheus Pod’s file system.

Since you’re using the Kube-Prometheus Helm chart in this guide, there are two ways to extract the rules YAML files:

  • kubectl exec into the Prometheus container and copy out the rules files, which Helm generated from the templated source.
  • Use the helm template command to generate the K8s manifests locally, and extract the rules files from this output.

In this guide, you’ll use the first method. If you’re not using Helm, you might want to generate or extract the rules files directly from the Jsonnet source or import the Jsonnet source directly using a tool like Grizzly. These methods go beyond the scope of this guide. You can also find a reduced set of dashboards and rules in the kubernetes-mixin project’s GitHub repo. The Kubernetes mixin is a subset of Kube Prometheus stack.

To extract the Kube-Prometheus rules:

  1. Fetch the latest release of cortex-tools from the Releases page.

  2. Once you’ve confirmed that you can run mimirtool locally, copy the rules files from your Prometheus instance’s container using kubectl exec:

    kubectl exec -n "default" "prometheus-foo-kube-prometheus-stack-prometheus-0" -- tar cf - "/etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0" | tar xf -
    

    In this command, replace:

    • prometheus-foo-kube-prometheus-stack-prometheus-0 with the name of your Prometheus Pod.
    • /etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0 with the path to Prometheus’s rules directory. You can find this by port-forwarding to your Prometheus Pod using kubectl port-forward, accessing Prometheus’s configuration from the UI (Status -> Configuration), and searching for the rule_files parameter.

    This will create a directory called etc in your current directory. Navigate through the nested hierarchy to locate a set of rules YAML files. Note that these files are symlinks to the actual rules definitions, which are in a hidden directory. Copy out these rule definitions to a more convenient location.

With the Kube Prometheus rules files available locally, you can upload them to Cloud Prometheus using cortex-tools.

Load rules into Grafana Cloud Prometheus

In this section, you’ll use cortex-tools to load the Kube-Prometheus stack recording and alerting rules into your Cloud Prometheus endpoint.

  1. Use the rules load command to load the defined rule groups to Grafana Cloud using the HTTP API.

    Warning: You may need to increase your stack’s default rule limits or break up rule groups as described in the introductory note before running this command.

    Active Series Warning: Note that your active series usage will increase with this step.

    mimirtool rules load --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_api_key> *.yaml
    

    Replace the parameters in the above command with the appropriate values. You can find these in your Grafana Cloud Portal. Be sure to omit the /api/prom path from the endpoint URL. You can learn how to create an API key by following Create a Grafana Cloud API key.

    This command loads the rules files into Cortex’s rule evaluation engine. If you encounter errors, use the --log.level=debug flag to increase the tool’s verbosity.

  2. After loading the rules, navigate to your hosted Grafana instance, then Grafana Cloud Alerting in the left-hand navigation menu, and finally Rules. From here, select the appropriate data source from the dropdown (ending in -prom). You should see a list of alerting and recording rules.

In the Reduce your active series usage guide, you limited metrics shipped to Grafana Cloud from the local Prometheus instance to only those referenced in dashboards. Now you need to expand this set of metrics to include those referenced in the recording and alerting rules you just imported.

Expand the allowlist to capture rules metrics

In this section, you’ll expand the allowlist of shipped metrics to include those referenced in Kube-Prometheus’s recording and alerting rules.

Active Series Warning: Note that your active series usage will increase with this step.

  1. Open the values.yaml file you used to configure remote_write in the Migrate a Kube-Prometheus Helm stack to Grafana Cloud guide

  2. Modify the values.yaml file as follows:

    prometheus:
      prometheusSpec:
        remoteWrite:
        - url: "<Your Cloud Prometheus instance remote_write endpoint>"
          basicAuth:
              username:
                name: kubepromsecret
                key: username
              password:
                name: kubepromsecret
                key: password
          writeRelabelConfigs:
          - sourceLabels:
            - "__name__"
            regex: ":node_memory_MemAvailable_bytes:sum|aggregator_unavailable_apiservice|aggregator_unavailable_apiservice_total|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_cluster_members|alertmanager_config_hash|alertmanager_config_last_reload_successful|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request:availability30d|apiserver_request:burnrate1d|apiserver_request:burnrate1h|apiserver_request:burnrate2h|apiserver_request:burnrate30m|apiserver_request:burnrate3d|apiserver_request:burnrate5m|apiserver_request:burnrate6h|apiserver_request_duration_seconds_bucket|apiserver_request_duration_seconds_count|apiserver_request_terminations_total|apiserver_request_total|cluster:node_cpu_seconds_total:rate5m|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code:apiserver_request_total:increase30d|code_resource:apiserver_request_total:rate5m|code_verb:apiserver_request_total:increase1h|code_verb:apiserver_request_total:increase30d|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_spec_completions|kube_job_status_succeeded|kube_node_spec_taint|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_status_phase|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_replicaset_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_shard_ordinal|kube_state_metrics_total_shards|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|kubernetes_build_info|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_Slab_bytes|node_namespace_pod:kube_pod_info:|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_packets_total|node_network_transmit_bytes_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_packets_total|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|node_textfile_scrape_error|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_vmstat_pgmajfault|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_config_last_reload_successful|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_notifications_alertmanagers_discovered|prometheus_notifications_errors_total|prometheus_notifications_queue_capacity|prometheus_notifications_queue_length|prometheus_notifications_sent_total|prometheus_operator_list_operations_failed_total|prometheus_operator_list_operations_total|prometheus_operator_managed_resources|prometheus_operator_node_address_lookup_errors_total|prometheus_operator_ready|prometheus_operator_reconcile_errors_total|prometheus_operator_reconcile_operations_total|prometheus_operator_syncs|prometheus_operator_watch_operations_failed_total|prometheus_operator_watch_operations_total|prometheus_remote_storage_failed_samples_total|prometheus_remote_storage_highest_timestamp_in_seconds|prometheus_remote_storage_queue_highest_sent_timestamp_seconds|prometheus_remote_storage_samples_failed_total|prometheus_remote_storage_samples_total|prometheus_remote_storage_shards_desired|prometheus_remote_storage_shards_max|prometheus_remote_storage_succeeded_samples_total|prometheus_rule_evaluation_failures_total|prometheus_rule_group_iterations_missed_total|prometheus_rule_group_rules|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_metadata_cache_entries|prometheus_target_scrape_pool_exceeded_label_limits_total|prometheus_target_scrape_pool_exceeded_target_limit_total|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_compactions_failed_total|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|prometheus_tsdb_reloads_failures_total|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket"
            action: "keep"
        replicaExternalLabelName: "__replica__"
        externalLabels: {cluster: "test"}
    
Note: In this guide, the metrics allowlist corresponds to version 16.12.0 of the Kube-Prometheus Helm chart. This list of metrics may change as dashboards and rules are updated, and you should regenerate the metrics allowlist using mimirtool.

Here, you expanded the allowlist of metrics to include those referenced in recording and alerting rules. As an additional step, you’ll turn off recording and alerting rule evaluation in your local Prometheus cluster.

Disable local Prometheus rules evaluation

Now that you’ve imported the Kube Prometheus recording and alerting rules to Grafana Cloud, you might want to turn off local Prometheus rule evaluation, as this will create additional data points and alerts for the same time series.

To disable local rule evaluation:

  1. Add the following to your values.yaml Helm configuration file:

    . . .
    defaultRules:
      create: false
      rules:
        alertmanager: false
        etcd: false
        general: false
        k8s: false
        kubeApiserver: false
        kubeApiserverAvailability: false
        kubeApiserverError: false
        kubeApiserverSlos: false
        kubelet: false
        kubePrometheusGeneral: false
        kubePrometheusNodeAlerting: false
        kubePrometheusNodeRecording: false
        kubernetesAbsent: false
        kubernetesApps: false
        kubernetesResources: false
        kubernetesStorage: false
        kubernetesSystem: false
        kubeScheduler: false
        kubeStateMetrics: false
        network: false
        node: false
        prometheus: false
        prometheusOperator: false
        time: false
    
  2. Roll out the changes using helm upgrade:

    helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack
    
  3. Once the changes have been rolled out, use port-forward to forward a local port to your Prometheus Service:

    kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090
    
  4. Navigate to http://localhost:9090 in your browser, and then Status and Rules, and verify that Prometheus has ceased evaluating recording and alerting rules.

To learn more, see the Helm chart’s values.yaml file.

Summary

In this guide, you learned how to migrate the Kube-Prometheus stack’s recording and alerting rules to Grafana Cloud Alerting.

To learn how to enable multi-cluster support for Kube-Prometheus rules, see Enable multi-cluster support.