Configure Kubernetes Monitoring
When you ship your Kubernetes cluster metrics and logs to Grafana Cloud, you can monitor and alert on resource usage and operations using our packaged dashboards and alerting rules.
By following the Kubernetes Monitoring quickstart instructions, you gain access to the following:
- Preconfigured manifests for deploying Grafana Agent and kube-state-metrics to your clusters.
- 9 Grafana dashboards to drill into resource usage and cluster operations, from the multi-cluster level down to individual containers and Pods.
- A set of recording rules to speed up dashboard queries.
- A set of alerting rules to alert on conditions. For example: Pods crash looping and Pods getting stuck in “not ready” status.
- A pre-configured (optional) allowlist of metrics referenced in the above dashboards, recording rules, and alerting rules to reduce your active series usage while still giving you visibility into core cluster metrics.
- Kubernetes Events in Grafana Cloud logs (beta). To learn more about this feature, please see Kubernetes Events.
We are heavily indebted to the open source kubernetes-mixin project, from which the dashboards, recording rules, and alerting rules have been derived. We will continue to contribute bug fixes and new features upstream.
Installation
When installing the packaged set of dashboards and alerts, you must also deploy Grafana Agent by following the quickstart instructions.
NOTE: You must have the Admin role to install dashboards and alerts, and to view the Grafana Agent configuration instructions.
Navigate to your Grafana Cloud instance.
Click the Kubernetes Monitoring icon (ship wheel).
Click Install dashboards and rules and follow the instructions. If you’ve installed the dashboards and alerts, but haven’t deployed and configured Agent to scrape metrics and collect logs, the instructions will show you how to deploy the following:
- Grafana Agent single-replica StatefulSet that will collect Prometheus metrics & Kubernetes events from objects in your cluster.
- Kube-state-metrics Helm chart (which deploys a KSM Deployment and Service, along with some other access control objects).
- Grafana Agent DaemonSet that will collect logs from Pods in your cluster.
Reinstall or upgrade
Grafana Agent, dashboards, alerting rules, recording rules, kube-state-metrics, and Kubernetes manifests are updated regularly. You must update these components manually to take advantage of any updates. To learn how to do this, see Updating Kubernetes Monitoring components.
Scraping application Pod metrics
By default, Grafana Kubernetes Monitoring only scrapes cAdvisor (1 per node), kubelet (1 per node), and kube-state-metrics (1 replica by default) endpoints. You can also configure Grafana Agent to scrape application Prometheus metrics, like those available at the standard /metrics
endpoint on Pods.
Take the following steps to add a scrape job targeting all /metrics
endpoints on your cluster Pods:
Add the following to the bottom of your Agent scrape config:
. . . - job_name: "kubernetes-pods" kubernetes_sd_configs: - role: pod relabel_configs: # Example relabel to scrape only pods that have # "example.io/should_be_scraped = true" annotation. # - source_labels: [__meta_kubernetes_pod_annotation_example_io_should_be_scraped] # action: keep # regex: true # # Example relabel to customize metric path based on pod # "example.io/metric_path = <metric path>" annotation. # - source_labels: [__meta_kubernetes_pod_annotation_example_io_metric_path] # action: replace # target_label: __metrics_path__ # regex: (.+) # # Example relabel to scrape only single, desired port for the pod # based on pod "example.io/scrape_port = <port>" annotation. # - source_labels: [__address__, __meta_kubernetes_pod_annotation_example_io_scrape_port] # action: replace # regex: ([^:]+)(?::\d+)?;(\d+) # replacement: $1:$2 # target_label: __address__ # Expose Pod labels as metric labels - action: labelmap regex: __meta_kubernetes_pod_label_(.+) # Expose Pod namespace as metric namespace label - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace # Expose Pod name as metric name label - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod
This config adds every defined Pod container port to Agent’s scrape targets, discovered using Agent’s Kubernetes service discovery mechanism. You can optionally uncomment the relevant sections to customize the metrics path (the default is
/metrics
), specify a sample port, or use Pod annotations to declaratively specify which targets Agent should scrape in your Pod manifests. To learn more please see the examples in the official Prometheus project repo.To learn more about configuring the Agent, please see Configure Grafana Agent from the Agent docs. To learn more about available
kubernetes_sd_configs
roles (we used thepod
role here) and labels, please see kubernetes_sd_config from the Prometheus docs.Deploy the updated config into your cluster using
kubectl apply -f
:kubectl apply -f <YOUR-CONFIGMAP>.yaml
Restart the agent to pick up the config changes.
kubectl rollout restart deployment/grafana-agent
Configured scrape targets
By default, Agent scrapes the following targets:
- cAdvisor, which is present on each node in your cluster and emits container resource usage metrics like CPU usage, memory usage, and disk usage
- kubelet, which is present on each node and emits metrics specific to the kubelet process like
kubelet_running_pods
andkubelet_running_container_count
- kube-state-metrics, which runs as a Deployment and Service in your cluster and emits Prometheus metrics that track the state of objects in your cluster, like Pods, Deployments, DaemonSets, and more
The default ConfigMap configures an allowlist to drop all metrics not referenced in the Kubernetes Monitoring dashboards, alerts, and recording rules. You can optionally modify this allowlist, replace it with a denylist (by using the drop
directive), omit it entirely, or move it to the remote_write
level so that it applies globally to all configured scrape jobs. To learn more, please see Reducing Prometheus metrics usage with relabeling.
Dashboards
Kubernetes Monitoring includes 9 dashboards out of the box to help you get started with observing and monitoring your Kubernetes clusters and their workloads. This set includes the following:
(Home) Kubernetes Overview, the principal dashboard that displays high-level cluster resource usage and configuration status.
Kubernetes / Compute Resources (7 dashboards), a set of dashboards to drill down into resource usage by the following levels:
- Multi-cluster
- Cluster
- Namespace (by Pods)
- Namespace (by workloads, like Deployments or DaemonSets)
- Node
- Pods and containers
- Workloads (Deployments, DaemonSets, StatefulSets, etc.)
These dashboards contain links to sub-objects, so you can jump from cluster, to Namespace, to Pod, etc.
Kubernetes / Kubelet, a dashboard that helps you understand Kubelet performance on your Nodes, and provides useful summary metrics like number of running Pods, Containers, and Volumes on a given Node .
Kubernetes / Persistent Volumes, a dashboard that helps you understand usage of your configured PersistentVolumes.
Alerting Rules
The following alerting rules are pre-configured to help you get up and running with Grafana Cloud alerts and get notified when issues arise with your clusters and their workloads:
Kubelet alerts:
KubeNodeNotReady
KubeNodeUnreachable
KubeletTooManyPods
KubeNodeReadinessFlapping
KubeletPlegDurationHigh
KubeletPodStartUpLatencyHigh
KubeletClientCertificateExpiration
KubeletClientCertificateExpiration
KubeletServerCertificateExpiration
KubeletServerCertificateExpiration
KubeletClientCertificateRenewalErrors
KubeletServerCertificateRenewalErrors
KubeletDown
Kubernetes system alerts:
KubeVersionMismatch
KubeClientErrors
Kubernetes resource usage alerts:
KubeCPUOvercommit
KubeMemoryOvercommit
KubeCPUQuotaOvercommit
KubeMemoryQuotaOvercommit
KubeQuotaAlmostFull
KubeQuotaFullyUsed
KubeQuotaExceeded
CPUThrottlingHigh
Kubernetes app alerts:
KubePodCrashLooping
KubePodNotRead
KubeDeploymentGenerationMismatch
KubeDeploymentReplicasMismatch
KubeStatefulSetReplicasMismatch
KubeStatefulSetGenerationMismatch
KubeStatefulSetUpdateNotRolledOut
KubeDaemonSetRolloutStuck
KubeContainerWaiting
KubeDaemonSetNotScheduled
KubeDaemonSetMisScheduled
KubeJobCompletion
KubeJobFailed
KubeHpaReplicasMismatch
KubeHpaMaxedOut
To learn more, see the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. You can update alerting rule links to point your own runbooks in these pre-configured alerts programmatically, using a tool like cortex-tools or grizzly. To learn more, see Prometheus and Loki rules with mimirtool and Alerts
Recording Rules
Kubernetes Monitoring includes the following recording rules to speed up dashboard queries and alerting rule evaluation:
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
node_namespace_pod_container:container_memory_working_set_bytes
node_namespace_pod_container:container_memory_rss
node_namespace_pod_container:container_memory_cache
node_namespace_pod_container:container_memory_swap
cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
namespace_memory:kube_pod_container_resource_requests:sum
cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
namespace_cpu:kube_pod_container_resource_requests:sum
cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
namespace_memory:kube_pod_container_resource_limits:sum
cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
namespace_cpu:kube_pod_container_resource_limits:sum
namespace_workload_pod:kube_pod_owner:relabel
namespace_workload_pod:kube_pod_owner:relabel
namespace_workload_pod:kube_pod_owner:relabel
Note that recording rules may emit time series with the same metric name, but different labels.
To learn how to modify these programmatically, please see Prometheus and Loki rules with mimirtool.
Metrics and usage
By default, Kubernetes Monitoring configures allowlists using Prometheus relabel_config blocks. To learn more about relabel_configs
, metric_relabel_configs
and write_relabel_configs
, see Reducing Prometheus metrics usage with relabeling.
These allowlists drop any metrics not referenced in the dashboards, rules, and alerts. To omit or modify the allowlists, modify the corresponding metric_relabel_configs
blocks in your Agent configuration. To learn more about analyzing and controlling active series usage, please consult Control Prometheus metrics usage.
Grafana Cloud billing is based on billable series. To learn more about the pricing model, please consult Active series and DPM.
Default active series usage varies depending on your Kubernetes cluster size (number of Nodes) and running workloads (number of Pods, containers, Deployments, etc.).
When testing on a Cloud provider’s Kubernetes offering, the following active series usage was observed:
- 3 node cluster, 17 running pods, 31 running containers: 3.8k active series
- The only Pods deployed into the cluster were Grafana Agent and kube-state-metrics. The rest were running in the
kube-system
Namespace and managed by the cloud provider
- The only Pods deployed into the cluster were Grafana Agent and kube-state-metrics. The rest were running in the
- From this baseline, active series usage roughly increased by:
- 1000 active series per additional Node
- 75 active series per additional Pod (vanilla Nginx Pods were deployed into the cluster)
These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets like application metrics, API server metrics, and scheduler metrics.
Logs
The default setup instructions will roll out a Grafana Agent DaemonSet to collect logs from all pods running in your cluster and ship these to Grafana Cloud Loki.
Traces
Kubernetes Monitoring will soon support out-of-the-box configuration for shipping traces to your hosted Tempo endpoint. In the meantime, you can get started shipping traces to Grafana Cloud by following the Agent Traces Quickstart. This will roll out a single-replica Agent Deployment that will receive Traces and remote_write
these to Grafana Cloud.
Grafana Cloud integrations
Grafana Cloud will soon support integrations on Kubernetes as a platform, like the Linux Node integration (node-exporter), Redis integration, MySQL integration, and many more. In the meantime, to use embedded Agent exporters/integrations, you must configure them manually. To learn how to do this, please see integrations_config.
Node-exporter metrics
For node-exporter or host system metrics, you can roll out the node-exporter Helm Chart and add the following Agent scrape config job:
. . .
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: integrations/node-exporter
kubernetes_sd_configs:
- namespaces:
names:
- NODE_EXPORTER_NAMESPACE_HERE
role: pod
relabel_configs:
- action: keep
regex: prometheus-node-exporter.*
source_labels:
- __meta_kubernetes_pod_label_app
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: instance
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
This will instruct Agent to scrape any Pod with the label app=prometheus-node-exporter.*
(the value is a regular expression). The Helm chart configures this label by default, but if you modify the Chart’s values.yaml
file or any other set defaults, you may have to adjust this scrape job accordingly. To learn more, please see this helpful set of examples.
Correlating data across metrics, logs, and traces
Documentation for configuring correlation across metrics, logs and traces, specifically for Kubernetes workloads is forthcoming.
In the interim period, please consult Intro to monitoring Kubernetes with Grafana Cloud. Note that this video was published prior to the release of Kubernetes Monitoring, so some concepts may differ slightly.
Kubernetes events (beta)
Kubernetes events provide helpful logging information emitted by K8s cluster controllers. Grafana Agent contains an embedded integration that watches for event objects in your clusters, and ships them to Grafana Cloud for long-term storage and analysis. To enable this feature, see Kubernetes events. The setup instructions will enable this feature by default in the Grafana Agent StatefulSet.
Related Grafana Cloud resources
Intro to Prometheus and Grafana Cloud
Prometheus is taking over the monitoring world! In this webinar, we will start with a quick introduction to the open source project that’s the de facto standard for monitoring modern, cloud native systems.
How to set up and visualize synthetic monitoring at scale with Grafana Cloud
Learn how to use Kubernetes, Grafana Loki, and Grafana Cloud’s synthetic monitoring feature to set up your infrastructure's checks in this GrafanaCONline session.
Using Grafana Cloud to drive manufacturing plant efficiency
This GrafanaCONline session tells how Grafana helps a 75-year-old manufacturing company with product quality and equipment maintenance.