Grafana Kubernetes MonitoringUnderstand Kubernetes Monitoring

Understand Kubernetes Monitoring

Grafana Kubernetes Monitoring lets you view all of your Kubernetes data in one place. If you are shipping kube-state-metrics to Grafana Cloud, you can inspect the health of your clusters, containers, and pods with little or no required configuration. You can also access preconfigured dashboards, alert rules, and recording rules.

This topic provides an overview of Kubernetes Monitoring. To get started right away with Kubernetes Monitoring, select one of the following:

Topicdescription
Configure Kubernetes MonitoringInstall and configure Kubernetes Monitoring to monitor your cluster. Learn what a Kubernetes Monitoring configuration provides, including manifests, dashboards, alert rules, and more.
Monitor an application on KubernetesLearn how to deploy an instrumented three-tier (data layer, app logic layer, load-balancing layer) web application into a Kubernetes cluster, and leverage Grafana Cloud’s built-in Kubernetes features for monitoring this application.

For a tour of Kubernetes Monitoring, see Navigate Kubernetes Monitoring.

Kubernetes Monitoring features

Kubernetes Monitoring give you access to the following:

ComponentDescription
ManifestsPreconfigured manifests for deploying Grafana Agent, Grafana’s telemetry collector, and kube-state-metrics to your clusters. See kube-state-metrics to learn which kube-state-metrics are scraped by default with Kubernetes Monitoring.
DashboardsNine Grafana dashboards to drill into resource usage and cluster operations, from the multi-cluster level down to individual containers and Pods.
Recording rulesA set of recording rules to speed up dashboard queries.
Alerting rulesA set of alerting rules to alert on conditions. For example: Pods crash looping and Pods getting stuck in “not ready” status.
AllowlistAn optional preconfigured allowlist of metrics referenced in the above dashboards, recording rules, and alerting rules to reduce your active series usage while still giving you visibility into core cluster metrics.
EventsGrafana Agent can configure an eventhandler integration to watch for Kubernetes Events in your clusters.

We are heavily indebted to the open source kubernetes-mixin project, from which the dashboards, recording rules, and alerting rules have been derived. We will continue to contribute bug fixes and new features upstream.

kube-state-metrics

The following metrics are required to use the Kubernetes Monitoring Cluster navigation feature:

- kube_namespace_status_phase
- container_cpu_usage_seconds_total
- kube_pod_status_phase
- kube_pod_start_time
- kube_pod_container_status_restarts_total
- kube_pod_container_info
- kube_pod_container_status_waiting_reason
- kube_daemonset.\*
- kube_replicaset.\*
- kube_statefulset.\*
- kube_job.\*
- kube_node*
- kube_cluster*
- node_cpu_seconds_total
- node_memory_MemAvailable_bytes
- node_filesystem_size_bytes
- node_namespace_pod_container
- container_memory_working_set_bytes
- job="integrations/kubernetes/eventhandler" (for event logs, comes default with Grafana agent)

NOTE: Logs are not required for Kubernetes Monitoring to work, but they provide additional context in some views of the Cluster Navigation tab. Log entries must be shipped to a Loki data source with cluster, namespace, and pod labels.

Alerting Rules

The following alerting rules are preconfigured to help you get up and running with Grafana Cloud alerts. You will be notified when issues arise with your clusters and their workloads.

Kubelet alerts

  • KubeNodeNotReady
  • KubeNodeUnreachable
  • KubeletTooManyPods
  • KubeNodeReadinessFlapping
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletClientCertificateExpiration
  • KubeletClientCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletClientCertificateRenewalErrors
  • KubeletServerCertificateRenewalErrors
  • KubeletDown

Kubernetes system alerts

  • KubeVersionMismatch
  • KubeClientErrors

Kubernetes resource usage alerts

  • KubeCPUOvercommit
  • KubeMemoryOvercommit
  • KubeCPUQuotaOvercommit
  • KubeMemoryQuotaOvercommit
  • KubeQuotaAlmostFull
  • KubeQuotaFullyUsed
  • KubeQuotaExceeded
  • CPUThrottlingHigh

Kubernetes app alerts

  • KubePodCrashLooping
  • KubePodNotRead
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeDaemonSetRolloutStuck
  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetMisScheduled
  • KubeJobCompletion
  • KubeJobFailed
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut

To learn more, see the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. You can update alerting rule links to point your own runbooks in these pre-configured alerts programmatically, using a tool like cortex-tools or grizzly. To learn more, see Prometheus and Loki rules with mimirtool and Alerts.

Recording Rules

Kubernetes Monitoring includes the following recording rules to speed up dashboard queries and alerting rule evaluation:

  • node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
  • node_namespace_pod_container:container_memory_working_set_bytes
  • node_namespace_pod_container:container_memory_rss
  • node_namespace_pod_container:container_memory_cache
  • node_namespace_pod_container:container_memory_swap
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
  • namespace_memory:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • namespace_cpu:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
  • namespace_memory:kube_pod_container_resource_limits:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • namespace_cpu:kube_pod_container_resource_limits:sum
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel

Note that recording rules may emit time series with the same metric name, but different labels.

To learn how to modify these programmatically, see Prometheus and Loki rules with mimirtool.

Allowlists

By default, Kubernetes Monitoring configures allowlists using Prometheus relabel_config blocks. To learn more about relabel_configs, metric_relabel_configs and write_relabel_configs, see Reducing Prometheus metrics usage with relabeling.

These allowlists drop any metrics not referenced in the dashboards, rules, and alerts. To omit or modify the allowlists, modify the corresponding metric_relabel_configs blocks in your Agent configuration. To learn more about analyzing and controlling active series usage, see Control Prometheus metrics usage.

Grafana Cloud billing is based on billable series. To learn more about the pricing model, see Active series and DPM.

Default active series usage varies depending on your Kubernetes cluster size (number of nodes) and running workloads (number of Pods, containers, Deployments, etc.).

When testing on a Cloud provider’s Kubernetes offering, the following active series usage was observed:

  • 3 node cluster, 17 running pods, 31 running containers: 3.8k active series
    • The only Pods deployed into the cluster were Grafana Agent and kube-state-metrics. The rest were running in the kube-system Namespace and managed by the cloud provider
  • From this baseline, active series usage roughly increased by:
    • 1000 active series per additional node
    • 75 active series per additional pod (vanilla Nginx Pods were deployed into the cluster)

These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets like application metrics, API server metrics, and scheduler metrics.

Logs

The default setup instructions roll out a Grafana Agent DaemonSet to collect logs from all pods running in your cluster and ship these to Grafana Cloud Loki.

Traces

Kubernetes Monitoring will soon support out-of-the-box configuration for shipping traces to your hosted Tempo endpoint. In the meantime, you can get started shipping traces to Grafana Cloud by following the Ship Kubernetes traces using Grafana Agent guide. This will roll out a single-replica Agent Deployment that will receive Traces and remote_write these to Grafana Cloud.

Direct deployment of Grafana Agent

You can deploy the Grafana Agent directly rather than using the Kubernetes Monitoring interface instructions. To do so, see the Grafana Agent configuration guides. The guides show you how to deploy the following:

  • Grafana Agent single-replica StatefulSet that will collect Prometheus metrics & Kubernetes events from objects in your cluster.
  • Kube-state-metrics Helm chart (which deploys a KSM Deployment and Service, along with some other access control objects).
  • Grafana Agent DaemonSet that will collect logs from Pods in your cluster.

Important: You should have only one job scraping kube-state-metrics. If you have multiple scrape jobs running at the same time, you might see an error similar to the following when you try to view objects in Cluster navigation: execution: found duplicate series for the match group...

Grafana Cloud integrations

Grafana Cloud does not currently support integrations on Kubernetes as a platform, like the Linux Node integration (node-exporter), Redis integration, MySQL integration, and others. Until this support is available, use embedded Agent exporters and integrations by configuring them manually. To learn how, see integrations_config.

Node-exporter metrics

For node-exporter or host system metrics, you can roll out the node-exporter Helm Chart and add the following Agent scrape config job:

              . . .
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: integrations/node-exporter
                kubernetes_sd_configs:
                  - namespaces:
                        names:
                          - NODE_EXPORTER_NAMESPACE_HERE
                    role: pod
                relabel_configs:
                  - action: keep
                    regex: prometheus-node-exporter.*
                    source_labels:
                      - __meta_kubernetes_pod_label_app
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_node_name
                    target_label: instance
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false

This will instruct Agent to scrape any Pod with the label app=prometheus-node-exporter.* (the value is a regular expression). The Helm chart configures this label by default, but if you modify the Chart’s values.yaml file or any other set defaults, you may have to adjust this scrape job accordingly. To learn more, see this set of examples.

Data correlation across metrics, logs, and traces

Prometheus and Grafana Loki’s shared metadata keeps the same labels for your Kubernetes cluster, so you can access correlated Kubernetes metrics and logs.

Documentation for configuring correlation across metrics, logs and traces, specifically for Kubernetes workloads is forthcoming. In the interim period, see Intro to monitoring Kubernetes with Grafana Cloud. Note that this video was published prior to the release of Kubernetes Monitoring, so some concepts may differ slightly.

Kubernetes events (beta)

Kubernetes events provide helpful logging information emitted by K8s cluster controllers. Grafana Agent contains an embedded integration that watches for event objects in your clusters, and ships them to Grafana Cloud for long-term storage and analysis. To enable this feature, see Set up Kubernetes event monitoring. The setup instructions enable this feature by default in the Grafana Agent StatefulSet.