Menu
Grafana Cloud

Manage your configuration

Kubernetes Monitoring gathers metrics, logs, and events, and calculates costs for your infrastructure. It also provides recording rules, alerting rules, and allowlists.

How Kubernetes Monitoring works

Kubernetes Monitoring uses the following to provide its data and visualizations.

cAdvisor

cAdvisor (one per Node) is present on each Node in your Cluster, and emits container resource usage metrics such as CPU usage, memory usage, and disk usage. Alloy collects these metrics and sends them to Grafana Cloud.

Cluster events

Kubernetes Cluster controllers emit information about events concerning the lifecycle of Pods, deployments, and Nodes within the Cluster. Alloy pulls these Cluster events using the Kubernetes API server and converts them into log lines, then sends them to Loki in Grafana Cloud.

Grafana Alloy

Grafana Alloy:

  • Collects all metrics, Cluster events, and Pod logs
  • Receives traces pushed from applications on Clusters
  • Sends the data to Grafana Cloud

kube-state-metrics

The kube-state-metrics service listens to Kubernetes API server events and exposes Prometheus metrics that document the state of your Cluster’s objects. Over a thousand different metrics provide the status, capacity, and health of individual containers, Pods, deployments, and other resources. Kubernetes Monitoring uses kube-state-metrics (one replica, by default) to enable you to see the links between Cluster, Node, Pod, and container.

kube-state-metrics:

  • Is a service that generates metrics from Kubernetes objects without modification, and is present on each Node
  • Emits metrics specific to the kubelet process, such as kubelet_running_pods and kubelet_running_container_count
  • Provides metrics on the state of objects in your Cluster (Pods, Deployments, DaemonSets)

The Kubernetes Monitoring Cluster navigation feature requires the following metrics:

  • kube_namespace_status_phase
  • container_cpu_usage_seconds_total
  • kube_pod_status_phase
  • kube_pod_start_time
  • kube_pod_container_status_restarts_total
  • kube_pod_container_info
  • kube_pod_container_status_waiting_reason
  • kube_daemonset.+
  • kube_replicaset.+
  • kube_statefulset.+
  • kube_job.+
  • kube_node.+
  • kube_cluster.+
  • node_cpu_seconds_total
  • node_memory_MemAvailable_bytes
  • node_filesystem_size_bytes
  • node_namespace_pod_container
  • container_memory_working_set_bytes

kubelet

kubelet (one per Node):

  • Is the primary “Node agent” present on each Node in the Cluster
  • Emits metrics specific to the kubelet process like kubelet_running_pods and kubelet_running_container_count
  • Ensures containers are running
  • Provides metrics on Pods and their containers

Grafana Alloy collects these metrics and sends them to Grafana Cloud.

Kubernetes mixins

Kubernetes Monitoring is heavily indebted to the open source kubernetes-mixin project, from which the recording and alerting rules are derived. Grafana Labs continue to contribute bug fixes and new features upstream.

Node Exporter

The Prometheus exporter node-exporter runs as a DaemonSet on the Cluster to:

  • Gather metrics on hardware and OS for Linux Nodes in the Cluster
  • Emit Prometheus metrics for the health and state of the Nodes in your Cluster

Grafana Alloy collects these metrics and sends them to Grafana Cloud.

OpenCost

Kubernetes Monitoring uses the combination of OpenCost and Grafana to allow you to monitor and managing costs related to your Kubernetes Cluster. For more details, refer to Manage costs.

Pod logs

Alloy pulls Pod logs from the workloads running within containers, and sends them to Loki.

Note

Kubernetes Monitoring doesn’t require Pod logs to work, but Pod logs do provide additional context in some views of the Cluster navigation tab. Log entries must be sent to a Loki data source with cluster, namespace, and pod labels.

Traces

Traces generated by applications within the Cluster are pushed to Grafana Alloy. The address options listed during the process of configuring with the Helm chart contain the configuration endpoints where traces can be pushed.

Windows Exporter

When monitoring Windows Nodes, the configuration installs the windows-exporter DaemonSet to ensure metrics are available for scraping.

Metrics management and control

The following are ways to control and manage metrics:

  • Reduce usage.
  • Identify unnecessary or duplicate metrics.
  • Analyze usage.
  • Use allowlists.

Identify unnecessary or duplicate metrics

To identify unnecessary or duplicate metrics that could come from within your Cluster, you can analyze current metrics usage and associated costs from the billing and usage dashboard located in your Grafana instance.

Analyze usage

For techniques to analyze usage, refer to Analyze Prometheus metrics costs.

Reduce usage with allowlists, relabeling, and metrics tuning

To reduce metrics to only those you want to receive, you can use and refine an allowlist. Out of the box, Kubernetes Monitoring has allowlists configured with Prometheus metric_relabel_configs blocks.

You can remove or modify allowlists by editing the corresponding metric_relabel_configs blocks in your Alloy configuration. To learn more about analyzing and controlling the metrics you want to receive through relabeling, refer to Reduce metrics costs by filtering collected and forwarded metrics.

Refer to the Custom Metrics Tuning section of the Helm chart to learn more about refining an existing allowlist as well as creating a custom allowlist.

Billable series

Grafana Cloud billing is based on billable series. To learn more about the pricing model, refer to Active series and DPM.

Default telemetry data collection (also called active series) varies depending on your Kubernetes Cluster size (number of Nodes) and running workloads (number of Pods, containers, Deployments, etc.).

When testing on a Cloud provider’s Kubernetes offering, the following active series usage was observed:

  • 3-Node Cluster, 17 running Pods, 31 running containers: 3.8k active series
    • The only Pods deployed into the Cluster were Grafana Agent and kube-state-metrics. The rest were running in the kube-system Namespace and managed by the cloud provider
  • From this baseline, active series usage roughly increased by:
    • 1000 active series per additional Node
    • 75 active series per additional Pod (vanilla Nginx Pods were deployed into the Cluster)

These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets such as application metrics, API server metrics, and scheduler metrics.

Logs management

You can control and manage logs by:

  • Only collecting logs from Pods in certain namespaces
  • Dropping logs based on content

For more on analyzing, customizing, and de-duplicating logs, refer to Logs in Explore.

Limit logs to only Pods from certain namespaces

By default, Kubernetes Monitoring gathers logs from Pods in every namespace. However, you may only want or need logs from Pods in certain namespaces.

In the Grafana Alloy configuration, you can add a relabel_configs block that either keeps Pods from the namespaces you want, or drops Pods from the namespaces that you don’t want.

For example, this would restrict to only the production and staging namespaces:

txt
rule {
  source_labels = ["namespace"]
  regex = "production|staging"
  action = "keep"
}

Drop logs based on content

Similarly to filtering to specific namespaces, you can use Loki processing rules to further process and optionally drop log lines.

For example, this processing stage drops any lines that contain the word debug:

txt
stage.drop {
  expression  = ".*(debug|DEBUG).*"
}

Recording rules

Kubernetes Monitoring includes the following recording rules to speed up queries and alerting rule evaluation:

  • node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
  • node_namespace_pod_container:container_memory_working_set_bytes
  • node_namespace_pod_container:container_memory_rss
  • node_namespace_pod_container:container_memory_cache
  • node_namespace_pod_container:container_memory_swap
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
  • namespace_memory:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • namespace_cpu:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
  • namespace_memory:kube_pod_container_resource_limits:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • namespace_cpu:kube_pod_container_resource_limits:sum
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel

Note

Recording rules may emit time series with the same metric name, but different labels. To modify these programmatically, refer to Set up Alerting for Cloud.

Alerting rules

Kubernetes Monitoring comes with preconfigured alerting rules to alert on conditions such as “Pods crash looping” and “Pods getting stuck in not ready”. The following alerting rules create alerts to notify you when issues arise with your Clusters and their workloads.

Kubelet alerts

  • KubeNodeNotReady
  • KubeNodeUnreachable
  • KubeletTooManyPods
  • KubeNodeReadinessFlapping
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletClientCertificateExpiration
  • KubeletClientCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletClientCertificateRenewalErrors
  • KubeletServerCertificateRenewalErrors
  • KubeletDown

Kubernetes system alerts

  • KubeVersionMismatch
  • KubeClientErrors

Kubernetes resource usage alerts

  • KubeCPUOvercommit
  • KubeMemoryOvercommit
  • KubeCPUQuotaOvercommit
  • KubeMemoryQuotaOvercommit
  • KubeQuotaAlmostFull
  • KubeQuotaFullyUsed
  • KubeQuotaExceeded
  • CPUThrottlingHigh

Kubernetes alerts

  • KubePodCrashLooping
  • KubePodNotRead
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeDaemonSetRolloutStuck
  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetMisScheduled
  • KubeJobCompletion
  • KubeJobFailed
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut

To learn more, refer to the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. To update programmatically the alerting rule links to point your own runbooks in these preconfigured alerts, use a tool like cortex-tools or grizzly.

Get support

To open a support ticket, navigate to your Grafana Cloud Portal, and click Open a Support Ticket.