KubernetesKubernetes Integration

Kubernetes Integration

The Kubernetes integration allows you to monitor and alert on resource usage and operations in a Kubernetes cluster. Kubernetes is an open-source container orchestration system that automates software container deployment, scaling, and management.

The Kubernetes integration provides the following:

  • Preconfigured manifests for deploying Grafana Agent and kube-state-metrics to your clusters.
  • 10 Grafana dashboards to drill into resource usage and cluster operations, from the multi-cluster level down to individual containers and Pods.
  • A set of recording rules to speed up dashboard queries.
  • A set of alerting rules to alert on conditions. For example: Pods crash looping and Pods getting stuck in “not ready” status.
  • A preconfigured (optional) allowlist of metrics referenced in the above dashboards, recording rules, and alerting rules to reduce your active series usage while still giving you visibility into core cluster metrics.

We are also heavily indebted to the open source kubernetes-mixin project, from which the dashboards, recording rules, and alerting rules have been derived. We will continue to contribute bug fixes and new features upstream.

Installing the Kubernetes Integration

Prerequisites

To install the Kubernetes integration, you’ll need the following:

  • A Kubernetes cluster.
  • The kubectl command-line tool installed and available on your machine.
  • The helm command-line tool installed and available on your machine. Our instructions only use Helm to deploy kube-state-metrics (KSM) to your cluster. If you do not wish to use Helm, you can deploy KSM using your preferred deployment tools, adjusting Grafana Agent’s scrape config accordingly.
  • A Grafana Cloud account.

Install the Kubernetes Integration

Navigate to your Hosted Grafana instance. You can find this in the Cloud Portal.

From here, click on Onboarding (lightning bolt icon) in the menu on the left, and then Walkthrough.

Click on Kubernetes and then Install Integration.

You’ll see a series of instructions for deploying the following:

  • Grafana Agent ServiceAccount, ClusterRole, ClusterRoleBinding and single-replica Deployment.
  • Kube-state-metrics Helm chart (which deploys a KSM Deployment and Service, along with some other access control objects).
  • Grafana Agent ConfigMap, to configure Agent to scrape Prometheus metrics from cAdvsior, kubelet, and kube-state-metrics endpoints in your cluster.

Deploy Grafana Agent resources

Run the following command from your shell to install the Grafana Agent into the default Namespace of your Kubernetes cluster:

MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/main/production/kubernetes/agent-bare.yaml NAMESPACE=default /bin/sh -c "$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/release/production/kubernetes/install-bare.sh)" | kubectl apply -f -

This installs a single replica Grafana Agent Deployment into your cluster and configures RBAC permissions for the Agent. If you would like to deploy the Agent into a different Namespace, change theNAMESPACE=default variable, ensuring that this Namespace already exists. The Agent will not run until it is configured.

Deploy kube-state-metrics Helm chart

Run the following commands from your shell to install kube-state-metrics into the default Namespace of your cluster:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update && helm install ksm prometheus-community/kube-state-metrics --set image.tag=v2.2.0

This installs the kube-state-metrics Helm chart into your cluster. kube-state-metrics watches Kubernetes resources in your cluster and emits Prometheus metrics that can be scraped by Grafana Agent. To learn more, please see the kube-state-metrics docs.

If you would like to deploy kube-state-metrics into a different Namespace, use the namespaceOverride parameter in the chart's values.yaml.

Configure the Grafana Agent

Note: The default scrape interval is 60s.

Paste the following script into your shell and run it to configure the Grafana Agent:

cat <<'EOF' |

kind: ConfigMap
metadata:
  name: grafana-agent
apiVersion: v1
data:
  agent.yaml: |    
    server:
      http_listen_port: 12345
    prometheus:
      wal_directory: /tmp/grafana-agent-wal
      global:
        scrape_interval: 60s
        external_labels:
          cluster: cloud
      configs:
      - name: integrations
        remote_write:
        - url: YOUR_REMOTE_WRITE_PUSH_URL
          basic_auth:
            username: YOUR_REMOTE_WRITE_USERNAME
            password: YOUR_REMOTE_WRITE_API_KEY
        scrape_configs:
        - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          job_name: integrations/kubernetes/cadvisor
          kubernetes_sd_configs:
              - role: node
          metric_relabel_configs:
              - source_labels: [__name__]
                regex: container_network_transmit_packets_total|kubelet_certificate_manager_server_ttl_seconds|storage_operation_duration_seconds_bucket|node_namespace_pod_container:container_memory_swap|container_fs_writes_total|container_network_receive_bytes_total|kube_daemonset_status_desired_number_scheduled|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|rest_client_requests_total|node_namespace_pod_container:container_memory_working_set_bytes|kubernetes_build_info|kube_node_status_capacity|kubelet_pleg_relist_duration_seconds_bucket|kubelet_running_pods|storage_operation_errors_total|kubelet_running_containers|kube_daemonset_status_number_misscheduled|kube_job_failed|kube_statefulset_status_replicas|kube_job_status_succeeded|container_cpu_cfs_throttled_periods_total|kube_statefulset_status_update_revision|process_resident_memory_bytes|kubelet_pod_start_duration_seconds_count|kubelet_running_container_count|container_fs_writes_bytes_total|machine_memory_bytes|kubelet_cgroup_manager_duration_seconds_count|node_namespace_pod_container:container_memory_rss|kubelet_node_config_error|kubelet_runtime_operations_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_job_spec_completions|kube_statefulset_status_current_revision|kube_statefulset_replicas|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_node_name|kubelet_pod_worker_duration_seconds_bucket|go_goroutines|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_runtime_operations_errors_total|kube_daemonset_status_number_available|kube_deployment_status_replicas_available|up|storage_operation_duration_seconds_count|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_updated|kube_node_status_condition|kube_node_status_allocatable|rest_client_request_duration_seconds_bucket|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|kubelet_pleg_relist_duration_seconds_count|kube_pod_owner|namespace_cpu:kube_pod_container_resource_requests:sum|kube_horizontalpodautoscaler_spec_max_replicas|kube_statefulset_status_replicas_ready|container_fs_reads_total|node_namespace_pod_container:container_memory_cache|container_network_transmit_packets_dropped_total|kubelet_volume_stats_inodes_used|kube_node_spec_taint|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_pod_info|kubelet_cgroup_manager_duration_seconds_bucket|process_cpu_seconds_total|container_memory_cache|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_count|volume_manager_total_volumes|namespace_cpu:kube_pod_container_resource_limits:sum|kube_deployment_metadata_generation|kube_replicaset_owner|container_memory_swap|kubelet_certificate_manager_client_ttl_seconds|kube_resourcequota|container_fs_reads_bytes_total|kubelet_runtime_operations_total|kube_horizontalpodautoscaler_status_desired_replicas|kube_pod_status_phase|kube_horizontalpodautoscaler_spec_min_replicas|kubelet_server_expiration_renew_errors|kube_pod_container_resource_limits|container_network_transmit_bytes_total|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kube_pod_container_status_waiting_reason|container_network_receive_packets_total|kube_namespace_created|namespace_workload_pod|kube_pod_container_resource_requests|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_deployment_status_replicas_updated|kube_statefulset_status_observed_generation|kube_deployment_status_observed_generation|container_cpu_cfs_periods_total|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kubelet_certificate_manager_client_expiration_renew_errors|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kube_daemonset_updated_number_scheduled|kubelet_volume_stats_inodes|kube_node_info|kube_deployment_spec_replicas|container_memory_rss|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_volume_stats_available_bytes
                action: keep
          relabel_configs:
              - replacement: kubernetes.default.svc.cluster.local:443
                target_label: __address__
              - regex: (.+)
                replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
                source_labels:
                  - __meta_kubernetes_node_name
                target_label: __metrics_path__
          scheme: https
          tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: false
              server_name: kubernetes
        - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          job_name: integrations/kubernetes/kubelet
          kubernetes_sd_configs:
              - role: node
          metric_relabel_configs:
              - source_labels: [__name__]
                regex: container_network_transmit_packets_total|kubelet_certificate_manager_server_ttl_seconds|storage_operation_duration_seconds_bucket|node_namespace_pod_container:container_memory_swap|container_fs_writes_total|container_network_receive_bytes_total|kube_daemonset_status_desired_number_scheduled|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|rest_client_requests_total|node_namespace_pod_container:container_memory_working_set_bytes|kubernetes_build_info|kube_node_status_capacity|kubelet_pleg_relist_duration_seconds_bucket|kubelet_running_pods|storage_operation_errors_total|kubelet_running_containers|kube_daemonset_status_number_misscheduled|kube_job_failed|kube_statefulset_status_replicas|kube_job_status_succeeded|container_cpu_cfs_throttled_periods_total|kube_statefulset_status_update_revision|process_resident_memory_bytes|kubelet_pod_start_duration_seconds_count|kubelet_running_container_count|container_fs_writes_bytes_total|machine_memory_bytes|kubelet_cgroup_manager_duration_seconds_count|node_namespace_pod_container:container_memory_rss|kubelet_node_config_error|kubelet_runtime_operations_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_job_spec_completions|kube_statefulset_status_current_revision|kube_statefulset_replicas|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_node_name|kubelet_pod_worker_duration_seconds_bucket|go_goroutines|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_runtime_operations_errors_total|kube_daemonset_status_number_available|kube_deployment_status_replicas_available|up|storage_operation_duration_seconds_count|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_updated|kube_node_status_condition|kube_node_status_allocatable|rest_client_request_duration_seconds_bucket|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|kubelet_pleg_relist_duration_seconds_count|kube_pod_owner|namespace_cpu:kube_pod_container_resource_requests:sum|kube_horizontalpodautoscaler_spec_max_replicas|kube_statefulset_status_replicas_ready|container_fs_reads_total|node_namespace_pod_container:container_memory_cache|container_network_transmit_packets_dropped_total|kubelet_volume_stats_inodes_used|kube_node_spec_taint|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_pod_info|kubelet_cgroup_manager_duration_seconds_bucket|process_cpu_seconds_total|container_memory_cache|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_count|volume_manager_total_volumes|namespace_cpu:kube_pod_container_resource_limits:sum|kube_deployment_metadata_generation|kube_replicaset_owner|container_memory_swap|kubelet_certificate_manager_client_ttl_seconds|kube_resourcequota|container_fs_reads_bytes_total|kubelet_runtime_operations_total|kube_horizontalpodautoscaler_status_desired_replicas|kube_pod_status_phase|kube_horizontalpodautoscaler_spec_min_replicas|kubelet_server_expiration_renew_errors|kube_pod_container_resource_limits|container_network_transmit_bytes_total|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kube_pod_container_status_waiting_reason|container_network_receive_packets_total|kube_namespace_created|namespace_workload_pod|kube_pod_container_resource_requests|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_deployment_status_replicas_updated|kube_statefulset_status_observed_generation|kube_deployment_status_observed_generation|container_cpu_cfs_periods_total|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kubelet_certificate_manager_client_expiration_renew_errors|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kube_daemonset_updated_number_scheduled|kubelet_volume_stats_inodes|kube_node_info|kube_deployment_spec_replicas|container_memory_rss|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_volume_stats_available_bytes
                action: keep
          relabel_configs:
              - replacement: kubernetes.default.svc.cluster.local:443
                target_label: __address__
              - regex: (.+)
                replacement: /api/v1/nodes/${1}/proxy/metrics
                source_labels:
                  - __meta_kubernetes_node_name
                target_label: __metrics_path__
          scheme: https
          tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: false
              server_name: kubernetes
        - job_name: integrations/kubernetes/kube-state-metrics
          kubernetes_sd_configs:
              - role: service
          metric_relabel_configs:
              - source_labels: [__name__]
                regex: container_network_transmit_packets_total|kubelet_certificate_manager_server_ttl_seconds|storage_operation_duration_seconds_bucket|node_namespace_pod_container:container_memory_swap|container_fs_writes_total|container_network_receive_bytes_total|kube_daemonset_status_desired_number_scheduled|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|rest_client_requests_total|node_namespace_pod_container:container_memory_working_set_bytes|kubernetes_build_info|kube_node_status_capacity|kubelet_pleg_relist_duration_seconds_bucket|kubelet_running_pods|storage_operation_errors_total|kubelet_running_containers|kube_daemonset_status_number_misscheduled|kube_job_failed|kube_statefulset_status_replicas|kube_job_status_succeeded|container_cpu_cfs_throttled_periods_total|kube_statefulset_status_update_revision|process_resident_memory_bytes|kubelet_pod_start_duration_seconds_count|kubelet_running_container_count|container_fs_writes_bytes_total|machine_memory_bytes|kubelet_cgroup_manager_duration_seconds_count|node_namespace_pod_container:container_memory_rss|kubelet_node_config_error|kubelet_runtime_operations_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_job_spec_completions|kube_statefulset_status_current_revision|kube_statefulset_replicas|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_node_name|kubelet_pod_worker_duration_seconds_bucket|go_goroutines|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_runtime_operations_errors_total|kube_daemonset_status_number_available|kube_deployment_status_replicas_available|up|storage_operation_duration_seconds_count|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_updated|kube_node_status_condition|kube_node_status_allocatable|rest_client_request_duration_seconds_bucket|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|kubelet_pleg_relist_duration_seconds_count|kube_pod_owner|namespace_cpu:kube_pod_container_resource_requests:sum|kube_horizontalpodautoscaler_spec_max_replicas|kube_statefulset_status_replicas_ready|container_fs_reads_total|node_namespace_pod_container:container_memory_cache|container_network_transmit_packets_dropped_total|kubelet_volume_stats_inodes_used|kube_node_spec_taint|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_pod_info|kubelet_cgroup_manager_duration_seconds_bucket|process_cpu_seconds_total|container_memory_cache|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_count|volume_manager_total_volumes|namespace_cpu:kube_pod_container_resource_limits:sum|kube_deployment_metadata_generation|kube_replicaset_owner|container_memory_swap|kubelet_certificate_manager_client_ttl_seconds|kube_resourcequota|container_fs_reads_bytes_total|kubelet_runtime_operations_total|kube_horizontalpodautoscaler_status_desired_replicas|kube_pod_status_phase|kube_horizontalpodautoscaler_spec_min_replicas|kubelet_server_expiration_renew_errors|kube_pod_container_resource_limits|container_network_transmit_bytes_total|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kube_pod_container_status_waiting_reason|container_network_receive_packets_total|kube_namespace_created|namespace_workload_pod|kube_pod_container_resource_requests|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_deployment_status_replicas_updated|kube_statefulset_status_observed_generation|kube_deployment_status_observed_generation|container_cpu_cfs_periods_total|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kubelet_certificate_manager_client_expiration_renew_errors|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kube_daemonset_updated_number_scheduled|kubelet_volume_stats_inodes|kube_node_info|kube_deployment_spec_replicas|container_memory_rss|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_volume_stats_available_bytes
                action: keep
          relabel_configs:
              - action: keep
                regex: ksm-kube-state-metrics
                source_labels:
                  - __meta_kubernetes_service_name
    
EOF
(export NAMESPACE=default && kubectl apply -n $NAMESPACE -f -)

Be sure to replace:

  • url with your hosted Prometheus push URL.
  • username with your hosted Prometheus username.
  • password with your Grafana Cloud API key.
  • cluster with your cluster name (the default is set to cloud).

You can find your URL, username, and password details in the Cloud Portal.

If you deployed the Agent to a non-default Namespace in the previous step, replace NAMESPACE=default in this command with the new Namespace.

This ConfigMap configures:

  • Agent to scrape the cadvisor. kubelet, and kube-state-metrics endpoints in your cluster, and ship scraped metrics to Grafana Cloud
  • metric_relabel_configs to allowlist only metrics referenced in dashboards, alerting rules, and recording rules from the Kubernetes integration; all other metrics will get dropped. You can add metrics to this allowlist or remove it entirely by omitting the metric_relabel_configs stanzas.

To learn more about configuring the Agent, please see Configure Grafana Agent from the Agent docs. After deploying the ConfigMap, you should restart the Agent deployment.

Restart Grafana Agent

If you modify the Agent’s ConfigMap, you will need to restart the Agent Pod to pick up configuration changes. Use kubectl rollout to restart the Agent:

kubectl rollout restart deployment/grafana-agent

At this point, kube-state-metrics and Grafana Agent should be up and running in your cluster, and Agent should be scraping the kubelet and cAdvisor /metrics endpoints on each node, as well as the kube-state-metrics Service. From here, you can modify your Agent config to scrape other targets in your cluster.

Reinstall or upgrade the Integration

To reinstall the Kubernetes integration or upgrade from a previous version, first uninstall the integration:

Warning: Uninstalling an integration will delete its associated dashboard folder and alert and recording rule namespace. Any custom dashboards or alerts added to the default locations for this integration will also be removed.

  • Click on Onboarding (lightning bolt in the left-hand navigation bar), and then Integrations Management.
  • Click on the Kubernetes integration.
  • Click on Uninstall and then Uninstall integration.

From here, you can reinstall the integration using Walkthrough. This will install the latest version of the Kubernetes integration into your hosted Grafana instance, and provision the latest version of the Grafana Agent Kubernetes manifests.

Scraping Application Pod Metrics

By default, the Kubernetes integration only scrapes cAdvisor (1 per node), kubelet (1 per node), and kube-state-metrics (1 replica by default) endpoints. You can also configure Grafana Agent to scrape application Prometheus metrics, like those available at the standard /metrics endpointon Pods.

For example, to add a scrape job targeting all /metrics endpoints on your cluster Pods, add the following to the bottom of your Agent scrape config:

 . . .
 - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Example relabel to scrape only pods that have
      # "example.io/should_be_scraped = true" annotation.
      #  - source_labels: [__meta_kubernetes_pod_annotation_example_io_should_be_scraped]
      #    action: keep
      #    regex: true
      #
      # Example relabel to customize metric path based on pod
      # "example.io/metric_path = <metric path>" annotation.
      #  - source_labels: [__meta_kubernetes_pod_annotation_example_io_metric_path]
      #    action: replace
      #    target_label: __metrics_path__
      #    regex: (.+)
      #
      # Example relabel to scrape only single, desired port for the pod
      # based on pod "example.io/scrape_port = <port>" annotation.
      #  - source_labels: [__address__, __meta_kubernetes_pod_annotation_example_io_scrape_port]
      #    action: replace
      #    regex: ([^:]+)(?::\d+)?;(\d+)
      #    replacement: $1:$2
      #    target_label: __address__
      # Expose Pod labels as metric labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      # Expose Pod namespace as metric namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      # Expose Pod name as metric name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

This config adds every defined Pod container port to Agent’s scrape targets, discovered using Agent’s Kubernetes service discovery mechanism. You can optionally uncomment the relevant sections to customize the metrics path (the default is /metrics), specify a sample port, or use Pod annotations to declaratively specify which targets Agent should scrape in your Pod manifests. To learn more please see the examples in the official Prometheus project repo.

To learn more about configuring the Agent, please see Configure Grafana Agent from the Agent docs. To learn more about available kubernetes_sd_configs roles (we used the pod role here) and labels, please see kubernetes_sd_config from the Prometheus docs.

You can update your Agent configuration by modifying the ConfigMap and redeploying it. After editing the above ConfigMap, deploy it into your cluster using kubectl apply -f:

kubectl apply -f your_configmap.yaml

Next, restart the Agent to pick up the config changes:

kubectl rollout restart deployment/grafana-agent

Configured Scrape Targets

By default, Agent scrapes the following targets:

  • cAdvisor, which is present on each node in your cluster and emits container resource usage metrics like CPU usage, memory usage, and disk usage
  • kubelet, which is present on each node and emits metrics specific to the kubelet process like kubelet_running_pods and kubelet_running_container_count
  • kube-state-metrics, which runs as a Deployment and Service in your cluster and emits Prometheus metrics that track the state of objects in your cluster, like Pods, Deployments, DaemonSets, and more

The default ConfigMap configures an allowlist to drop all metrics not referenced in the Kubernetes integration dashboards, alerts, and recording rules. You can optionally modify this allowlist, replace it with a denylist (by using the drop directive), omit it entirely, or move it to the remote_write level so that it applies globally to all configured scrape jobs. To learn more, please see Reducing Prometheus metrics usage with relabeling.

Dashboards

The Kubernetes integration includes 10 dashboards out of the box to help you get started with observing and monitoring your Kubernetes clusters and their workloads. This set includes the following:

  • (Home) Kubernetes Integration, the principal integration dashboard that displays high-level cluster resource usage and integration configuration status.

  • Kubernetes / Compute Resources (7 dashboards), a set of dashboards to drill down into resource usage by the following levels:

    • Multi-cluster
    • Cluster
    • Namespace (by Pods)
    • Namespace (by workloads, like Deployments or DaemonSets)
    • Node
    • Pods and containers
    • Workloads (Deployments, DaemonSets, StatefulSets, etc.)

    These dashboards contain links to sub-objects, so you can jump from cluster, to Namespace, to Pod, etc.

  • Kubernetes / Kubelet, a dashboard that helps you understand Kubelet performance on your Nodes, and provides useful summary metrics like number of running Pods, Containers, and Volumes on a given Node .

  • Kubernetes / Persistent Volumes, a dashboard that helps you understand usage of your configured PersistentVolumes.

Alerting Rules

The Kubernetes integration includes the following alerting rules to help you get up and running with Grafana Cloud alerts and get notified when issues arise with your clusters and their workloads:

Kubelet alerts:

  • KubeNodeNotReady
  • KubeNodeUnreachable
  • KubeletTooManyPods
  • KubeNodeReadinessFlapping
  • KubeletPlegDurationHigh
  • KubeletPodStartUpLatencyHigh
  • KubeletClientCertificateExpiration
  • KubeletClientCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeletClientCertificateRenewalErrors
  • KubeletServerCertificateRenewalErrors
  • KubeletDown

Kubernetes system alerts:

  • KubeVersionMismatch
  • KubeClientErrors

Kubernetes resource usage alerts:

  • KubeCPUOvercommit
  • KubeMemoryOvercommit
  • KubeCPUQuotaOvercommit
  • KubeMemoryQuotaOvercommit
  • KubeQuotaAlmostFull
  • KubeQuotaFullyUsed
  • KubeQuotaExceeded
  • CPUThrottlingHigh

Kubernetes app alerts:

  • KubePodCrashLooping
  • KubePodNotRead
  • KubeDeploymentGenerationMismatch
  • KubeDeploymentReplicasMismatch
  • KubeStatefulSetReplicasMismatch
  • KubeStatefulSetGenerationMismatch
  • KubeStatefulSetUpdateNotRolledOut
  • KubeDaemonSetRolloutStuck
  • KubeContainerWaiting
  • KubeDaemonSetNotScheduled
  • KubeDaemonSetMisScheduled
  • KubeJobCompletion
  • KubeJobFailed
  • KubeHpaReplicasMismatch
  • KubeHpaMaxedOut

To learn more, see the upstream Kubernetes-Mixin’s Kubernetes Alert Runbooks page. You can update alerting rule links to point your own runbooks in these pre-configured alerts programmatically, using a tool like cortex-tools or grizzly. To learn more, see Prometheus and Loki rules with cortextool and Alerts

Recording Rules

The Kubernetes integration includes the following recording rules to speed up dashboard queries and alerting rule evaluation:

  • node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
  • node_namespace_pod_container:container_memory_working_set_bytes
  • node_namespace_pod_container:container_memory_rss
  • node_namespace_pod_container:container_memory_cache
  • node_namespace_pod_container:container_memory_swap
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
  • namespace_memory:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • namespace_cpu:kube_pod_container_resource_requests:sum
  • cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
  • namespace_memory:kube_pod_container_resource_limits:sum
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • namespace_cpu:kube_pod_container_resource_limits:sum
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel
  • namespace_workload_pod:kube_pod_owner:relabel

Note that recording rules may emit time series with the same metric name, but different labels.

To learn how to modify these programmatically, please see Prometheus and Loki rules with cortextool.

Metrics and Usage

By default, the Kubernetes integration configures allowlists using Prometheus relabel_config blocks. To learn more about relabel_configs, metric_relabel_configs and write_relabel_configs, please see Reducing Prometheus metrics usage with relabeling.

These allowlists drop any metrics not referenced in integration dashboards, rules, and alerts. To omit or modify the allowlists, modify the corresponding metric_relabel_configs blocks in your Agent configuration. To learn more about analyzing and controlling active series usage, please consult Control Prometheus metrics usage.

Grafana Cloud billing is based on billable series. To learn more about the pricing model, please consult Active series and DPM.

Default active series usage varies depending on your Kubernetes cluster size (number of Nodes) and running workloads (number of Pods, containers, Deployments, etc.).

When testing on a cloud provider’s Kubernetes offering, the following active series usage was observed:

  • 3 node cluster, 17 running pods, 31 running containers: 3.8k active series
    • The only Pods deployed into the cluster were Grafana Agent and kube-state-metrics. The rest were running in the kube-system Namespace and managed by the cloud provider
  • From this baseline, active series usage roughly increased by:
    • 1000 active series per additional Node
    • 75 active series per additional Pod (vanilla Nginx Pods were deployed into the cluster)

These are very rough guidelines and results may vary depending on your Cloud provider or Kubernetes version. Note also that these figures are based on the scrape targets configured above, and not additional targets like application metrics, API server metrics, and scheduler metrics.

Logs

The Kubernetes integration will soon support out-of-the-box configuration for shipping logs to your hosted Loki endpoint. In the meantime, you can get started shipping Pod logs to Grafana Cloud by following the Agent Logs Quickstart. This will roll out a DaemonSet of Grafana Agents into your cluster that will tail container logs on each Node and remote_write these to Grafana Cloud.

Traces

The Kubernetes integration will soon support out-of-the-box configuration for shipping traces to your hosted Tempo endpoint. In the meantime, you can get started shipping traces to Grafana Cloud by following the Agent Traces Quickstart. This will roll out a single-replica Agent Deployment that will receive Traces and remote_write these to Grafana Cloud.

Grafana Cloud Integrations

Grafana Cloud will soon support integrations on Kubernetes as a platform, like the Linux Node Integration (node-exporter), Redis integration, MySQL integration, and many more. In the meantime, to use embedded Agent exporters/integrations, you must configure them manually. To learn how to do this, please see integrations_config from the Agent docs.

Node-exporter metrics

For node-exporter or host system metrics, you can roll out the node-exporter Helm Chart and add the following Agent scrape config job:

              . . .
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: integrations/node-exporter
                kubernetes_sd_configs:
                  - namespaces:
                        names:
                          - NODE_EXPORTER_NAMESPACE_HERE
                    role: pod
                relabel_configs:
                  - action: keep
                    regex: node-exporter
                    source_labels:
                      - __meta_kubernetes_pod_label_name
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_node_name
                    target_label: instance
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false

Correlating data across Metrics, Logs, and Traces

Documentation for configuring correlation across metrics, logs and traces, specifically for Kubernetes workloads is forthcoming.

In the interim period, please consult Intro to monitoring Kubernetes with Grafana Cloud. Note that this video was published prior to the release of the current version of the Kubernetes integration, so some concepts may differ slightly.