Grafana Kubernetes MonitoringConfigure Kubernetes Monitoring

Configure Kubernetes Monitoring

When you ship your Kubernetes cluster metrics and logs to Grafana Cloud, you can monitor and alert on resource usage and operations using our packaged dashboards and alerting rules.

By following the Kubernetes Monitoring quickstart instructions, you gain access to the following:

  • Preconfigured manifests for deploying Grafana Agent and kube-state-metrics to your clusters.
  • 9 Grafana dashboards to drill into resource usage and cluster operations, from the multi-cluster level down to individual containers and Pods.
  • A set of recording rules to speed up dashboard queries.
  • A set of alerting rules to alert on conditions. For example: Pods crash looping and Pods getting stuck in “not ready” status.
  • A pre-configured (optional) allowlist of metrics referenced in the above dashboards, recording rules, and alerting rules to reduce your active series usage while still giving you visibility into core cluster metrics.
  • Kubernetes Events in Grafana Cloud logs (beta). To learn more about this feature, please see Kubernetes Events.

We are heavily indebted to the open source kubernetes-mixin project, from which the dashboards, recording rules, and alerting rules have been derived. We will continue to contribute bug fixes and new features upstream.


When installing the packaged set of dashboards and alerts, you must also deploy Grafana Agent by following the quickstart instructions.

NOTE: You must have the Admin role to install dashboards and alerts, and to view the Grafana Agent configuration instructions.

  1. Navigate to your Grafana Cloud instance.

  2. Click the Kubernetes Monitoring icon (ship wheel).

  3. Click Install dashboards and rules and follow the instructions. If you’ve installed the dashboards and alerts, but haven’t deployed and configured Agent to scrape metrics and collect logs, the instructions will show you how to deploy the following:

    • Grafana Agent single-replica StatefulSet that will collect Prometheus metrics & Kubernetes events from objects in your cluster.
    • Kube-state-metrics Helm chart (which deploys a KSM Deployment and Service, along with some other access control objects).
    • Grafana Agent DaemonSet that will collect logs from Pods in your cluster.

Reinstall or upgrade

Grafana Agent, dashboards, alerting rules, recording rules, kube-state-metrics, and Kubernetes manifests are updated regularly. You must update these components manually to take advantage of any updates. To learn how to do this, see Updating Kubernetes Monitoring components.

Scraping application Pod metrics

By default, Grafana Kubernetes Monitoring only scrapes cAdvisor (1 per node), kubelet (1 per node), and kube-state-metrics (1 replica by default) endpoints. You can also configure Grafana Agent to scrape application Prometheus metrics, like those available at the standard /metrics endpoint on Pods.

Take the following steps to add a scrape job targeting all /metrics endpoints on your cluster Pods:

  1. Add the following to the bottom of your Agent scrape config:

    . . .
    - job_name: "kubernetes-pods"
          - role: pod
          # Example relabel to scrape only pods that have
          # " = true" annotation.
          #  - source_labels: [__meta_kubernetes_pod_annotation_example_io_should_be_scraped]
          #    action: keep
          #    regex: true
          # Example relabel to customize metric path based on pod
          # " = <metric path>" annotation.
          #  - source_labels: [__meta_kubernetes_pod_annotation_example_io_metric_path]
          #    action: replace
          #    target_label: __metrics_path__
          #    regex: (.+)
          # Example relabel to scrape only single, desired port for the pod
          # based on pod " = <port>" annotation.
          #  - source_labels: [__address__, __meta_kubernetes_pod_annotation_example_io_scrape_port]
          #    action: replace
          #    regex: ([^:]+)(?::\d+)?;(\d+)
          #    replacement: $1:$2
          #    target_label: __address__
          # Expose Pod labels as metric labels
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          # Expose Pod namespace as metric namespace label
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: namespace
          # Expose Pod name as metric name label
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: pod

    This config adds every defined Pod container port to Agent’s scrape targets, discovered using Agent’s Kubernetes service discovery mechanism. You can optionally uncomment the relevant sections to customize the metrics path (the default is /metrics), specify a sample port, or use Pod annotations to declaratively specify which targets Agent should scrape in your Pod manifests. To learn more please see the examples in the official Prometheus project repo.

    To learn more about configuring the Agent, please see Configure Grafana Agent from the Agent docs. To learn more about available kubernetes_sd_configs roles (we used the pod role here) and labels, please see kubernetes_sd_config from the Prometheus docs.

  2. Deploy the updated config into your cluster using kubectl apply -f:

    kubectl apply -f <YOUR-CONFIGMAP>.yaml
  3. Restart the agent to pick up the config changes.

    kubectl rollout restart deployment/grafana-agent

Configured scrape targets

By default, Agent scrapes the following targets:

  • cAdvisor, which is present on each node in your cluster and emits container resource usage metrics like CPU usage, memory usage, and disk usage
  • kubelet, which is present on each node and emits metrics specific to the kubelet process like kubelet_running_pods and kubelet_running_container_count
  • kube-state-metrics, which runs as a Deployment and Service in your cluster and emits Prometheus metrics that track the state of objects in your cluster, like Pods, Deployments, DaemonSets, and more

The default ConfigMap configures an allowlist to drop all metrics not referenced in the Kubernetes Monitoring dashboards, alerts, and recording rules. You can optionally modify this allowlist, replace it with a denylist (by using the drop directive), omit it entirely, or move it to the remote_write level so that it applies globally to all configured scrape jobs. To learn more, please see Reducing Prometheus metrics usage with relabeling.


Kubernetes Monitoring includes 9 dashboards out of the box to help you get started with observing and monitoring your Kubernetes clusters and their workloads. This set includes the following:

  • (Home) Kubernetes Overview, the principal dashboard that displays high-level cluster resource usage and configuration status.

  • Kubernetes / Compute Resources (7 dashboards), a set of dashboards to drill down into resource usage by the following levels:

    • Multi-cluster
    • Cluster
    • Namespace (by Pods)
    • Namespace (by workloads, like Deployments or DaemonSets)
    • Node
    • Pods and containers
    • Workloads (Deployments, DaemonSets, StatefulSets, etc.)

    These dashboards contain links to sub-objects, so you can jump from cluster, to Namespace, to Pod, etc.

  • Kubernetes / Kubelet, a dashboard that helps you understand Kubelet performance on your Nodes, and provides useful summary metrics like number of running Pods, Containers, and Volumes on a given Node .

  • Kubernetes / Persistent Volumes, a dashboard that helps you understand usage of your configured PersistentVolumes.

Alerting Rules

The following alerting rules are pre-configured to help you get up and running with Grafana Cloud alerts and get notified when issues arise with your clusters and their workloads:

Kubelet alerts:

  • KubeNodeNotReady
  • KubeNodeUnreachable
  • KubeletTooManyPods
  • KubeNodeReadinessFlapping
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletClientCertificateExpiration
  • KubeletClientCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletClientCertificateRenewalErrors
  • KubeletServerCertificateRenewalErrors
  • KubeletDown

Kubernetes system alerts:

  • KubeVersionMismatch
  • KubeClientErrors

Kubernetes resource usage alerts:

  • KubeCPUOvercommit
  • KubeMemoryOvercommit
  • KubeCPUQuotaOvercommit
  • KubeMemoryQuotaOvercommit
  • KubeQuotaAlmostFull
  • KubeQuotaFullyUsed
  • KubeQuotaExceeded
  • CPUThrottlingHigh

Kubernetes app alerts:

  • KubePodCrashLooping
  • KubePodNotRead
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeDaemonSetRolloutStuck
  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetMisScheduled
  • KubeJobCompletion
  • KubeJobFailed
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut

To learn more, see the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. You can update alerting rule links to point your own runbooks in these pre-configured alerts programmatically, using a tool like cortex-tools or grizzly. To learn more, see Prometheus and Loki rules with mimirtool and Alerts

Recording Rules

Kubernetes Monitoring includes the following recording rules to speed up dashboard queries and alerting rule evaluation:

  • node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
  • node_namespace_pod_container:container_memory_working_set_bytes
  • node_namespace_pod_container:container_memory_rss
  • node_namespace_pod_container:container_memory_cache
  • node_namespace_pod_container:container_memory_swap
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
  • namespace_memory:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • namespace_cpu:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
  • namespace_memory:kube_pod_container_resource_limits:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • namespace_cpu:kube_pod_container_resource_limits:sum
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel

Note that recording rules may emit time series with the same metric name, but different labels.

To learn how to modify these programmatically, please see Prometheus and Loki rules with mimirtool.

Metrics and usage

By default, Kubernetes Monitoring configures allowlists using Prometheus relabel_config blocks. To learn more about relabel_configs, metric_relabel_configs and write_relabel_configs, see Reducing Prometheus metrics usage with relabeling.

These allowlists drop any metrics not referenced in the dashboards, rules, and alerts. To omit or modify the allowlists, modify the corresponding metric_relabel_configs blocks in your Agent configuration. To learn more about analyzing and controlling active series usage, please consult Control Prometheus metrics usage.

Grafana Cloud billing is based on billable series. To learn more about the pricing model, please consult Active series and DPM.

Default active series usage varies depending on your Kubernetes cluster size (number of Nodes) and running workloads (number of Pods, containers, Deployments, etc.).

When testing on a Cloud provider’s Kubernetes offering, the following active series usage was observed:

  • 3 node cluster, 17 running pods, 31 running containers: 3.8k active series
    • The only Pods deployed into the cluster were Grafana Agent and kube-state-metrics. The rest were running in the kube-system Namespace and managed by the cloud provider
  • From this baseline, active series usage roughly increased by:
    • 1000 active series per additional Node
    • 75 active series per additional Pod (vanilla Nginx Pods were deployed into the cluster)

These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets like application metrics, API server metrics, and scheduler metrics.


The default setup instructions will roll out a Grafana Agent DaemonSet to collect logs from all pods running in your cluster and ship these to Grafana Cloud Loki.


Kubernetes Monitoring will soon support out-of-the-box configuration for shipping traces to your hosted Tempo endpoint. In the meantime, you can get started shipping traces to Grafana Cloud by following the Agent Traces Quickstart. This will roll out a single-replica Agent Deployment that will receive Traces and remote_write these to Grafana Cloud.

Grafana Cloud integrations

Grafana Cloud will soon support integrations on Kubernetes as a platform, like the Linux Node integration (node-exporter), Redis integration, MySQL integration, and many more. In the meantime, to use embedded Agent exporters/integrations, you must configure them manually. To learn how to do this, please see integrations_config.

Node-exporter metrics

For node-exporter or host system metrics, you can roll out the node-exporter Helm Chart and add the following Agent scrape config job:

              . . .
              - bearer_token_file: /var/run/secrets/
                job_name: integrations/node-exporter
                  - namespaces:
                          - NODE_EXPORTER_NAMESPACE_HERE
                    role: pod
                  - action: keep
                    regex: prometheus-node-exporter.*
                      - __meta_kubernetes_pod_label_app
                  - action: replace
                      - __meta_kubernetes_pod_node_name
                    target_label: instance
                  - action: replace
                      - __meta_kubernetes_namespace
                    target_label: namespace
                    ca_file: /var/run/secrets/
                    insecure_skip_verify: false

This will instruct Agent to scrape any Pod with the label app=prometheus-node-exporter.* (the value is a regular expression). The Helm chart configures this label by default, but if you modify the Chart’s values.yaml file or any other set defaults, you may have to adjust this scrape job accordingly. To learn more, please see this helpful set of examples.

Correlating data across metrics, logs, and traces

Documentation for configuring correlation across metrics, logs and traces, specifically for Kubernetes workloads is forthcoming.

In the interim period, please consult Intro to monitoring Kubernetes with Grafana Cloud. Note that this video was published prior to the release of Kubernetes Monitoring, so some concepts may differ slightly.

Kubernetes events (beta)

Kubernetes events provide helpful logging information emitted by K8s cluster controllers. Grafana Agent contains an embedded integration that watches for event objects in your clusters, and ships them to Grafana Cloud for long-term storage and analysis. To enable this feature, see Kubernetes events. The setup instructions will enable this feature by default in the Grafana Agent StatefulSet.