Monitor infrastructure

Kubernetes Monitoring

Configure Kubernetes Monitoring

Manage your configuration

Grafana Cloud

Manage your configuration

Kubernetes Monitoring gathers metrics, logs, and events, and calculates costs for your infrastructure. It also provides recording rules, alerting rules, and allowlists.

How Kubernetes Monitoring works

Kubernetes Monitoring uses the following to provide its data and visualizations.

cAdvisor

cAdvisor (one per Node) is present on each Node in your Cluster, and emits container resource usage metrics such as CPU usage, memory usage, and disk usage. Alloy collects these metrics and sends them to Grafana Cloud.

Cluster events

Kubernetes Cluster controllers emit information about events concerning the lifecycle of Pods, deployments, and Nodes within the Cluster. Alloy pulls these Cluster events using the Kubernetes API server and converts them into log lines, then sends them to Loki in Grafana Cloud.

Grafana Alloy

Grafana Alloy:

Collects all metrics, Cluster events, and Pod logs
Receives traces pushed from applications on Clusters
Sends the data to Grafana Cloud

kube-state-metrics

The kube-state-metrics service listens to Kubernetes API server events and exposes Prometheus metrics that document the state of your Cluster’s objects. Over a thousand different metrics provide the status, capacity, and health of individual containers, Pods, deployments, and other resources. Kubernetes Monitoring uses kube-state-metrics (one replica, by default) to enable you to see the links between Cluster, Node, Pod, and container.

kube-state-metrics:

Is a service that generates metrics from Kubernetes objects without modification, and is present on each Node
Emits metrics specific to the kubelet process, such as kubelet_running_pods and kubelet_running_container_count
Provides metrics on the state of objects in your Cluster (Pods, Deployments, DaemonSets)

The Kubernetes Monitoring Cluster navigation feature requires the following metrics:

kube_namespace_status_phase
container_cpu_usage_seconds_total
kube_pod_status_phase
kube_pod_start_time
kube_pod_container_status_restarts_total
kube_pod_container_info
kube_pod_container_status_waiting_reason
kube_daemonset.+
kube_replicaset.+
kube_statefulset.+
kube_job.+
kube_node.+
kube_cluster.+
node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_filesystem_size_bytes
node_namespace_pod_container
container_memory_working_set_bytes

kubelet

kubelet (one per Node):

Is the primary “Node agent” present on each Node in the Cluster
Emits metrics specific to the kubelet process like kubelet_running_pods and kubelet_running_container_count
Ensures containers are running
Provides metrics on Pods and their containers

Grafana Alloy collects these metrics and sends them to Grafana Cloud.

Kubernetes mixins

Kubernetes Monitoring is heavily indebted to the open source kubernetes-mixin project, from which the recording and alerting rules are derived. Grafana Labs continue to contribute bug fixes and new features upstream.

Node Exporter

The Prometheus exporter node-exporter runs as a DaemonSet on the Cluster to:

Gather metrics on hardware and OS for Linux Nodes in the Cluster
Emit Prometheus metrics for the health and state of the Nodes in your Cluster

Grafana Alloy collects these metrics and sends them to Grafana Cloud.

OpenCost

Kubernetes Monitoring uses the combination of OpenCost and Grafana to allow you to monitor and managing costs related to your Kubernetes Cluster. For more details, refer to Manage costs.

Pod logs

Alloy pulls Pod logs from the workloads running within containers, and sends them to Loki.

Note
Kubernetes Monitoring doesn’t require Pod logs to work, but Pod logs do provide additional context in some views of the Cluster navigation tab. Log entries must be sent to a Loki data source with cluster, namespace, and pod labels.

Traces

Traces generated by applications within the Cluster are pushed to Grafana Alloy. The address options listed during the process of configuring with the Helm chart contain the configuration endpoints where traces can be pushed.

Windows Exporter

When monitoring Windows Nodes, the configuration installs the windows-exporter DaemonSet to ensure metrics are available for scraping.

Metrics management and control

The following are ways to control and manage metrics:

Reduce usage.
Identify unnecessary or duplicate metrics.
Analyze usage.
Use allowlists.

Identify unnecessary or duplicate metrics

To identify unnecessary or duplicate metrics that could come from within your Cluster, you can:

Use the Cardinality page to discover on a Cluster-by-Cluster basis where all your active series are coming from. From the main menu, click Configuration and then the Cardinality tab.
Cardinality page within the app
Analyze current metrics usage and associated costs from the billing and usage dashboard located in your Grafana instance.

Analyze usage

For techniques to analyze usage, refer to Analyze Prometheus metrics costs.

Reduce usage with allowlists, relabeling, and metrics tuning

To reduce metrics to only those you want to receive, you can use and refine an allowlist. Out of the box, Kubernetes Monitoring has allowlists configured with Prometheus metric_relabel_configs blocks.

You can remove or modify allowlists by editing the corresponding metric_relabel_configs blocks in your Alloy configuration. To learn more about analyzing and controlling the metrics you want to receive through relabeling, refer to Reduce metrics costs by filtering collected and forwarded metrics.

Refer to the Custom Metrics Tuning section of the Helm chart to learn more about refining an existing allowlist as well as creating a custom allowlist.

Billable series

Grafana Cloud billing is based on billable series. To learn more about the pricing model, refer to Active series and DPM.

Default telemetry data collection (also called active series) varies depending on your Kubernetes Cluster size (number of Nodes) and running workloads (number of Pods, containers, Deployments, etc.).

When testing on a Cloud provider’s Kubernetes offering, the following active series usage was observed:

3-Node Cluster, 17 running Pods, 31 running containers: 3.8k active series
- The only Pods deployed into the Cluster were Grafana Agent and kube-state-metrics. The rest were running in the kube-system Namespace and managed by the cloud provider
From this baseline, active series usage roughly increased by:
- 1000 active series per additional Node
- 75 active series per additional Pod (vanilla Nginx Pods were deployed into the Cluster)

These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets such as application metrics, API server metrics, and scheduler metrics.

Logs management

You can control and manage logs by:

Only collecting logs from Pods in certain namespaces
Dropping logs based on content

For more on analyzing, customizing, and de-duplicating logs, refer to Logs in Explore.

Limit logs to only Pods from certain namespaces

By default, Kubernetes Monitoring gathers logs from Pods in every namespace. However, you may only want or need logs from Pods in certain namespaces.

In the Grafana Alloy configuration, you can add a relabel_configs block that either keeps Pods from the namespaces you want, or drops Pods from the namespaces that you don’t want.

For example, this would restrict to only the production and staging namespaces:

rule {
  source_labels = ["namespace"]
  regex = "production|staging"
  action = "keep"
}

Drop logs based on content

Similarly to filtering to specific namespaces, you can use Loki processing rules to further process and optionally drop log lines.

For example, this processing stage drops any lines that contain the word debug:

stage.drop {
  expression  = ".*(debug|DEBUG).*"
}

Recording rules

Kubernetes Monitoring includes the following recording rules to speed up queries and alerting rule evaluation:

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
node_namespace_pod_container:container_memory_working_set_bytes
node_namespace_pod_container:container_memory_rss
node_namespace_pod_container:container_memory_cache
node_namespace_pod_container:container_memory_swap
cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
namespace_memory:kube_pod_container_resource_requests:sum
cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
namespace_cpu:kube_pod_container_resource_requests:sum
cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
namespace_memory:kube_pod_container_resource_limits:sum
cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
namespace_cpu:kube_pod_container_resource_limits:sum
namespace_workload_pod:kube_pod_owner:relabel
namespace_workload_pod:kube_pod_owner:relabel
namespace_workload_pod:kube_pod_owner:relabel

Note
Recording rules may emit time series with the same metric name, but different labels. To modify these programmatically, refer to Set up Alerting for Cloud.

Alerting rules

Kubernetes Monitoring comes with preconfigured alerting rules to alert on conditions such as “Pods crash looping” and “Pods getting stuck in not ready”. The following alerting rules create alerts to notify you when issues arise with your Clusters and their workloads.

Kubelet alerts

KubeNodeNotReady
KubeNodeUnreachable
KubeletTooManyPods
KubeNodeReadinessFlapping
KubeletPlegDurationHigh
KubeletPodStartUpLatencyHigh
KubeletClientCertificateExpiration
KubeletClientCertificateExpiration
KubeletServerCertificateExpiration
KubeletServerCertificateExpiration
KubeletClientCertificateRenewalErrors
KubeletServerCertificateRenewalErrors
KubeletDown

Kubernetes system alerts

KubeVersionMismatch
KubeClientErrors

Kubernetes resource usage alerts

KubeCPUOvercommit
KubeMemoryOvercommit
KubeCPUQuotaOvercommit
KubeMemoryQuotaOvercommit
KubeQuotaAlmostFull
KubeQuotaFullyUsed
KubeQuotaExceeded
CPUThrottlingHigh

Kubernetes alerts

KubePodCrashLooping
KubePodNotRead
KubeDeploymentGenerationMismatch
KubeDeploymentReplicasMismatch
KubeStatefulSetReplicasMismatch
KubeStatefulSetGenerationMismatch
KubeStatefulSetUpdateNotRolledOut
KubeDaemonSetRolloutStuck
KubeContainerWaiting
KubeDaemonSetNotScheduled
KubeDaemonSetMisScheduled
KubeJobCompletion
KubeJobFailed
KubeHpaReplicasMismatch
KubeHpaMaxedOut

To learn more, refer to the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. To update programmatically the alerting rule links to point your own runbooks in these preconfigured alerts, use a tool like cortex-tools or grizzly.

Get support

To open a support ticket, navigate to your Grafana Cloud Portal, and click Open a Support Ticket.

Feedback

Manage your configuration