Monitor infrastructure

Kubernetes Monitoring

Configure Kubernetes Monitoring

Configure manually for infrastructure

Migrate Kube-Prometheus Helm stack

Import recording and alerting rules

Grafana Cloud

Import recording and alerting rules

You can import Kube-Prometheus’s recording and alerting rules to Grafana Cloud. Recording rules let you cache expensive queries at customizable intervals to reduce load on your Prometheus instances and improve performance. To learn more, refer to Recording rules. Alerting rules allow you to define alert conditions based on PromQL queries and your Prometheus metrics.

Avoid evaluating the same recording rules in both your local Prometheus instance and on Grafana Cloud, as this creates additional data points for the same time series. You may wish to split up recording and alerting rule evaluation across your local Prometheus instance and Grafana Cloud Prometheus. This lets you keep alerting and recording rule evaluation local to your cluster, and use Grafana Cloud for rules that require global, multi-cluster aggregations.

Warning
To prevent abuse, your stack has been limited by default to 20 rules per rule group and 35 rule groups. The Kube-Prometheus stack currently contains roughly 28 rule groups, and one group, kube-apiserver.rules, contains 21 rules. To import the full set of recording rules to Grafana Cloud, you must increase your default rules-per-group limit from 20. To do so, open a ticket with Support from the Cloud Portal. You can also break up this rule group into two smaller rule groups, and import these smaller groups, which fit within the 20-rules-per-group default limit. To learn more, refer to Defining recording rules.

Note
To enable multi-cluster support for Kube-Prometheus rules, refer to Enable multi-cluster support.

Extract Kube-Prometheus rules from Prometheus Pod

Extract the Kube-Prometheus recording and alerting rules from the Prometheus Pod’s file system. Since you’re using the Kube-Prometheus Helm chart in these steps, there are two ways to extract the rules YAML files:

kubectl exec into the Prometheus container and copy out the rules files, which Helm generated from the templated source.
Use the helm template command to generate the K8s manifests locally, and extract the rules files from this output.

For these steps, use the first method. If you are not using Helm, you might want to do either of the following:

Generate or extract the rules files directly from the Jsonnet source
Import the Jsonnet source directly using a tool like Grizzly.

These methods go beyond the scope of these instructions. You can also find a reduced set of rules in the kubernetes-mixin project’s GitHub repository. The Kubernetes mixin is a subset of Kube Prometheus stack.

To extract the Kube-Prometheus rules:

Fetch the latest release of cortex-tools from the Releases page.
After you confirm that you can run mimirtool locally, copy the rules files from your Prometheus instance’s container using kubectl exec:
```
kubectl exec -n "default" "prometheus-foo-kube-prometheus-stack-prometheus-0" -- tar cf - "/etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0" | tar xf -
```
In this command, replace:
- prometheus-foo-kube-prometheus-stack-prometheus-0 with the name of your Prometheus Pod
- /etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0 with the path to Prometheus’s rules directory. To find this, port-forward to your Prometheus Pod using kubectl port-forward, access Prometheus’ configuration from the UI (Status -> Configuration), and search for the rule_files parameter.
This will create a directory called etc in your current directory. Navigate through the nested hierarchy to locate a set of rules YAML files.
Note
These files are symlinks to the actual rules definitions, which are in a hidden directory. Copy these rule definitions to a more convenient location.

With the Kube Prometheus rules files available locally, you can upload them to Cloud Prometheus using cortex-tools.

Load rules into Grafana Cloud Prometheus

Use cortex-tools to load the Kube-Prometheus stack recording and alerting rules into your Cloud Prometheus endpoint.

Use the rules load command to load the defined rule groups to Grafana Cloud using the HTTP API.
Warning
Your active series usage will increase with this step. You may need to increase your stack’s default rule limits or break up rule groups as described in the introductory note before running this command.
```
mimirtool rules load --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_cloud_access_policy_token> *.yaml
```
Replace the parameters in the previous command with the appropriate values. You can find these in your Grafana Cloud portal. Be sure to omit the /api/prom path from the endpoint URL. To learn how to create a Cloud Access Policy token, follow the instructions in Create a Grafana Cloud Access Policy.
This command loads the rules files into Cortex’s rule evaluation engine. If you encounter errors, use the --log.level=debug flag to increase the tool’s verbosity.
After loading the rules, navigate to your hosted Grafana instance, then Grafana Cloud Alerting in the left-hand menu, and finally Rules. Select the appropriate data source from the dropdown menu (ending in -prom). You should see a list of alerting and recording rules.

In the Reduce your active series usage instructions, you limited metrics sent to Grafana Cloud from the local Prometheus instance to only those referenced in Kubernetes Monitoring. Now you need to expand this set of metrics to include those referenced in the recording and alerting rules you just imported.

Expand the allowlist to capture rules metrics

Expand the allowlist of sent metrics to include those referenced in Kube-Prometheus’s recording and alerting rules.

Warning
Your active series usage will increase with this step.

Open the values.yaml file you used to configure remote_write in the Migrate a Kube-Prometheus Helm stack to Grafana Cloud guide

Modify the values.yaml file as follows:

prometheus:
  prometheusSpec:
    remoteWrite:
    - url: "<Your Cloud Prometheus instance remote_write endpoint>"
      basicAuth:
          username:
            name: kubepromsecret
            key: username
          password:
            name: kubepromsecret
            key: password
      writeRelabelConfigs:
      - sourceLabels:
        - "__name__"
        regex: ":node_memory_MemAvailable_bytes:sum|aggregator_unavailable_apiservice|aggregator_unavailable_apiservice_total|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_cluster_members|alertmanager_config_hash|alertmanager_config_last_reload_successful|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request:availability30d|apiserver_request:burnrate1d|apiserver_request:burnrate1h|apiserver_request:burnrate2h|apiserver_request:burnrate30m|apiserver_request:burnrate3d|apiserver_request:burnrate5m|apiserver_request:burnrate6h|apiserver_request_duration_seconds_bucket|apiserver_request_duration_seconds_count|apiserver_request_terminations_total|apiserver_request_total|cluster:node_cpu_seconds_total:rate5m|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code:apiserver_request_total:increase30d|code_resource:apiserver_request_total:rate5m|code_verb:apiserver_request_total:increase1h|code_verb:apiserver_request_total:increase30d|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_spec_completions|kube_job_status_succeeded|kube_node_spec_taint|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_status_phase|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_replicaset_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_shard_ordinal|kube_state_metrics_total_shards|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|kubernetes_build_info|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_Slab_bytes|node_namespace_pod:kube_pod_info:|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_packets_total|node_network_transmit_bytes_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_packets_total|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|node_textfile_scrape_error|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_vmstat_pgmajfault|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_config_last_reload_successful|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_notifications_alertmanagers_discovered|prometheus_notifications_errors_total|prometheus_notifications_queue_capacity|prometheus_notifications_queue_length|prometheus_notifications_sent_total|prometheus_operator_list_operations_failed_total|prometheus_operator_list_operations_total|prometheus_operator_managed_resources|prometheus_operator_node_address_lookup_errors_total|prometheus_operator_ready|prometheus_operator_reconcile_errors_total|prometheus_operator_reconcile_operations_total|prometheus_operator_syncs|prometheus_operator_watch_operations_failed_total|prometheus_operator_watch_operations_total|prometheus_remote_storage_failed_samples_total|prometheus_remote_storage_highest_timestamp_in_seconds|prometheus_remote_storage_queue_highest_sent_timestamp_seconds|prometheus_remote_storage_samples_failed_total|prometheus_remote_storage_samples_total|prometheus_remote_storage_shards_desired|prometheus_remote_storage_shards_max|prometheus_remote_storage_succeeded_samples_total|prometheus_rule_evaluation_failures_total|prometheus_rule_group_iterations_missed_total|prometheus_rule_group_rules|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_metadata_cache_entries|prometheus_target_scrape_pool_exceeded_label_limits_total|prometheus_target_scrape_pool_exceeded_target_limit_total|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_compactions_failed_total|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|prometheus_tsdb_reloads_failures_total|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket"
        action: "keep"
    replicaExternalLabelName: "__replica__"
    externalLabels: {cluster: "test"}

Note
In these steps, the metrics allowlist corresponds to version 16.12.0 of the Kube-Prometheus Helm chart. This list of metrics may change as rules are updated.

Disable local Prometheus rules evaluation

Local Prometheus rule evaluation will create additional data points and alerts for the same time series. After importing the Kube Prometheus recording and alerting rules to Grafana Cloud, you might want to turn off this local rule evaluation.

To disable local rule evaluation:

Add the following to your values.yaml Helm configuration file:

. . .
defaultRules:
  create: false
  rules:
    alertmanager: false
    etcd: false
    general: false
    k8s: false
    kubeApiserver: false
    kubeApiserverAvailability: false
    kubeApiserverError: false
    kubeApiserverSlos: false
    kubelet: false
    kubePrometheusGeneral: false
    kubePrometheusNodeAlerting: false
    kubePrometheusNodeRecording: false
    kubernetesAbsent: false
    kubernetesApps: false
    kubernetesResources: false
    kubernetesStorage: false
    kubernetesSystem: false
    kubeScheduler: false
    kubeStateMetrics: false
    network: false
    node: false
    prometheus: false
    prometheusOperator: false
    time: false

Apply the changes using helm upgrade:

helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack

After the changes have been applied, use port-forward to forward a local port to your Prometheus Service:
```
kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090
```
Navigate to http://localhost:9090 in your browser, then Status and Rules.
Verify that Prometheus has ceased evaluating recording and alerting rules.

To learn more, refer to the Helm chart’s values.yaml file.

Next steps

To learn how to enable multi-cluster support for Kube-Prometheus rules, refer to Enable multi-cluster support.

Feedback

Import recording and alerting rules

Extract Kube-Prometheus rules from Prometheus Pod

Load rules into Grafana Cloud Prometheus

Expand the allowlist to capture rules metrics

Disable local Prometheus rules evaluation

Next steps

Was this page helpful?

Related documentation

Feedback

Import recording and alerting rules

Extract Kube-Prometheus rules from Prometheus Pod

Load rules into Grafana Cloud Prometheus

Expand the allowlist to capture rules metrics

Disable local Prometheus rules evaluation

Next steps

Was this page helpful?

Related documentation

Related resources from Grafana Labs