Menu
Grafana Cloud

Manage your configuration

Kubernetes Monitoring gathers metrics, logs, and events, and calculates costs for your infrastructure. It also provides recording rules, alerting rules, and allowlists.

Metrics control and management

The following are ways to control and manage metrics:

  • Reduce usage
  • Identify unnecessary or duplicate metrics
  • Analyze usage
  • Use allowlists

Reduce usage

The best way to control and manage your metrics is to use the techniques detailed in Reduce Kubernetes Metrics usage.

Identify unnecessary or duplicate metrics

To identify unnecessary or duplicate metrics that could come from within your cluster, you can analyze:

Analyze usage

Refer to Analyzing metrics usage with Grafana Explore for more techniques.

Use allowlists

By default, Kubernetes Monitoring configures allowlists using Prometheus relabel_config blocks. To learn more about metric_relabel_configs, refer to Reduce Prometheus metrics usage with relabeling.

These allowlists trim metrics collected to a useful set. To omit or modify the allowlists, modify the corresponding metric_relabel_configs blocks in your Agent configuration. To learn more about analyzing and controlling active series usage, refer to Control Prometheus metrics usage.

Grafana Cloud billing is based on billable series. To learn more about the pricing model, refer to Active series and DPM.

Default active series usage varies depending on your Kubernetes cluster size (number of Nodes) and running workloads (number of Pods, containers, Deployments, etc.).

When testing on a Cloud provider’s Kubernetes offering, the following active series usage was observed:

  • 3-Node cluster, 17 running Pods, 31 running containers: 3.8k active series
    • The only Pods deployed into the cluster were Grafana Agent and kube-state-metrics. The rest were running in the kube-system Namespace and managed by the cloud provider
  • From this baseline, active series usage roughly increased by:
    • 1000 active series per additional Node
    • 75 active series per additional Pod (vanilla Nginx Pods were deployed into the cluster)

These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets such as application metrics, API server metrics, and scheduler metrics.

Logs management

To analyze, customize, and deduplicate logs, refer to Logs in Explore.

How Kubernetes Monitoring works

We use kube-state-metrics to generate metrics from Kubernetes objects without modification, and send these metrics to Grafana Cloud. We use Loki to collect logs from the Kubernetes objects and send the logs to Grafana Cloud.

We are heavily indebted to the open source kubernetes-mixin project, from which the dashboards, recording rules, and alerting rules have been derived. We will continue to contribute bug fixes and new features upstream.

Metrics

Kubernetes Monitoring scrapes the following items to provide metrics:

  • cAdvisor: Running Daemon that provides information about running containers. Provides metrics on container resource usage (CPU, memory, disk).
  • kubelet: Primary “node agent” that runs on each Node in the cluster and ensures containers are running. Provides metrics on Pods and their containers.
  • kube-state-metrics: Service that generates metrics from Kubernetes objects without modification. Provides metrics on the state of objects in your cluster (Pods, Deployments, DaemonSets). Required for the cluster navigation feature.
  • node-exporter: Prometheus exporter. Gathers gathering hardware and OS metrics for Linux Nodes in the cluster.

kube-state-metrics

The following metrics are required to use the Kubernetes Monitoring cluster navigation feature:

bash
- kube_namespace_status_phase
- container_cpu_usage_seconds_total
- kube_pod_status_phase
- kube_pod_start_time
- kube_pod_container_status_restarts_total
- kube_pod_container_info
- kube_pod_container_status_waiting_reason
- kube_daemonset.\*
- kube_replicaset.\*
- kube_statefulset.\*
- kube_job.\*
- kube_node*
- kube_cluster*
- node_cpu_seconds_total
- node_memory_MemAvailable_bytes
- node_filesystem_size_bytes
- node_namespace_pod_container
- container_memory_working_set_bytes
- job="integrations/kubernetes/eventhandler" (for event logs, comes default with Grafana agent)
Note: Logs are not required for Kubernetes Monitoring to work, but they provide additional context in some views of the Cluster navigation tab. Log entries must be sent to a Loki data source with cluster, namespace, and pod labels.

Logs

Kubernetes Monitoring uses Agent Flow mode to collect logs from all Pods running in your cluster and send them to Loki in Grafana Cloud.

Events

Kubernetes events provide helpful logging information emitted by Kubernetes cluster controllers. Agent Flow mode contains an embedded integration that watches for event objects in your clusters, and sends them to Grafana Cloud for long-term storage and analysis.

An Eventhandler deployed by Kubernetes Monitoring watches for Kubernetes events in your clusters.

Cost calculations

Kubernetes Monitoring uses OpenCost and Grafana’s experience in managing costs related to Kubernetes. For more details, refer to Manage costs.

Recording rules

Kubernetes Monitoring includes the following recording rules to speed up dashboard queries and alerting rule evaluation:

  • node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
  • node_namespace_pod_container:container_memory_working_set_bytes
  • node_namespace_pod_container:container_memory_rss
  • node_namespace_pod_container:container_memory_cache
  • node_namespace_pod_container:container_memory_swap
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
  • namespace_memory:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • namespace_cpu:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
  • namespace_memory:kube_pod_container_resource_limits:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • namespace_cpu:kube_pod_container_resource_limits:sum
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel
Note: Recording rules may emit time series with the same metric name, but different labels. To modify these programmatically, refer to Set up Alerting for Cloud.

Alerting rules

Kubernetes Monitoring comes with alerting rules to alert on conditions, such as “Pods crash looping and Pods getting stuck in not ready”. The following alerting rules are pre-configured. You will be notified when issues arise with your clusters and their workloads.

Kubelet alerts

  • KubeNodeNotReady
  • KubeNodeUnreachable
  • KubeletTooManyPods
  • KubeNodeReadinessFlapping
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletClientCertificateExpiration
  • KubeletClientCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletClientCertificateRenewalErrors
  • KubeletServerCertificateRenewalErrors
  • KubeletDown

Kubernetes system alerts

  • KubeVersionMismatch
  • KubeClientErrors

Kubernetes resource usage alerts

  • KubeCPUOvercommit
  • KubeMemoryOvercommit
  • KubeCPUQuotaOvercommit
  • KubeMemoryQuotaOvercommit
  • KubeQuotaAlmostFull
  • KubeQuotaFullyUsed
  • KubeQuotaExceeded
  • CPUThrottlingHigh

Kubernetes alerts

  • KubePodCrashLooping
  • KubePodNotRead
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeDaemonSetRolloutStuck
  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetMisScheduled
  • KubeJobCompletion
  • KubeJobFailed
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut

To learn more, refer to the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. You can update programmatically the alerting rule links to point your own runbooks in these preconfigured alerts, using a tool like cortex-tools or grizzly.

Get support

To open a support ticket, navigate to your Grafana Cloud Portal, and click Open a Support Ticket.