Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

Grot cannot remember your choice unless you click the consent notice at the bottom.

How to use Argo CD to configure Kubernetes Monitoring in Grafana Cloud

How to use Argo CD to configure Kubernetes Monitoring in Grafana Cloud

May 23, 2023 12 min

Since Kubernetes Monitoring launched in Grafana Cloud last year, we have introduced highly customizable dashboards and powerful analytics features. We’ve also focused on how to make monitoring and managing resource utilization within your fleet easier and more efficient. 

But what’s an easy way to add resources to your cluster while using Kubernetes Monitoring? ArgoCD has become the established tool for implementing GitOps workflows, simplifying the process of deploying and managing Kubernetes resources. If you are using Argo CD to manage your Kubernetes deployments, you can easily configure Grafana Cloud’s Kubernetes Monitoring solution to work with your cluster. 

In this blog post, we’ll walk you through how to use Argo CD to configure Kubernetes Monitoring in Grafana Cloud. 

What is Argo CD?

Argo CD is a declarative, open source GitOps continuous delivery tool that helps to automate the deployment of applications to Kubernetes clusters. It provides a centralized and standardized way of managing Kubernetes resources in large and complex environments. By using Argo CD, users can easily manage their Kubernetes resources and applications using Git repositories, allowing for version control and auditability.

Argo CD’s key features and capabilities include the ability to manually or automatically deploy applications to a Kubernetes cluster; automatic synchronization of application state to the current version of declarative configuration; a web user interface and command-line interface; and the ability to detect and remediate configuration drift. It also has role-based access control (RBAC), which allows for multi-cluster management, single sign-on (SSO) with various providers, and support for webhooks that trigger actions.

Why pair Argo CD with Kubernetes Monitoring in Grafana Cloud?

Kubernetes Monitoring in Grafana Cloud simplifies monitoring your Kubernetes fleet by:

  • Deploying Grafana Agent with useful defaults for collecting metrics that aid in cardinality management.
  • Providing custom queries to collect useful logs.
  • Offering preconfigured dashboards, alert rules, and recording rules for users to plug into their existing infrastructure. 

But there is some manual effort required to deploy the necessary resources for Kubernetes Monitoring in Grafana Cloud. Users have to copy the configurations and run the commands, which are outlined in Grafana Cloud. While easy enough to do, this process can quickly become error-prone, especially in larger, more complex environments where multiple teams may be deploying resources across multiple clusters. Plus each change must be verified and deployed to each cluster by hand.

As a result, GitOps tools like Argo CD can help to automate Kubernetes resource deployment. Argo CD takes a declarative, GitOps approach to Kubernetes resource management, allowing for automated and standardized deployments across multiple clusters. This method not only saves time, but also reduces the risk of deployment errors and inconsistencies.

There are three key advantages to deploying Kubernetes Monitoring with Argo CD:

  1. You’ll use a GitOps-based approach to manage resources required by Grafana Cloud’s Kubernetes Monitoring solution.
  2. It’s easier to make changes in a controlled environment to Grafana Agent configuration and resources.
  3. The Git repository would be maintained as a single source of truth for your Kubernetes resources.
Architecture diagram of Argo CD set up with Grafana Cloud using arrows and logos.

Prerequisites for configuring Argo CD with Grafana Cloud

Before you begin, make sure you have the following available:

  • Argo CD installed and running in your Kubernetes cluster
  • A Git repository to store configuration files
  • A Grafana Cloud account (You can sign up for free today!)

Metrics and events: Deploy Grafana Agent ConfigMap and StatefulSet

Create Grafana Agent ConfigMap

Note: The default scrape interval is set to 60s. Scrape intervals lower than 60s may result in increased costs (DPM > 1). To learn more, see our documentation on active series and DPM.

Replace
${NAMESPACE} with your namespace
${METRICS_USERNAME} with Prometheus Instance ID
${PROMETHEUS_URL} with the push endpoint URL of the Prometheus Instance
${LOGS_USERNAME} with Loki Username ${LOKI_URL} with the push endpoint URL of the Loki Instance
${GRAFANA_CLOUD_API_KEY}  with your Grafana Cloud API Key

Paste the following script into your shell and run it to create the metrics Agent ConfigMap configuration file.

cat >> metrics.yaml <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
 name: grafana-agent
 namespace: ${NAMESPACE}
data:
 agent.yaml: |   
   metrics:
     wal_directory: /var/lib/agent/wal
     global:
       scrape_interval: 60s
       external_labels:
         cluster: cloud
     configs:
     - name: integrations
       remote_write:
       - url: ${PROMETHEUS_URL}
         basic_auth:
           username:${METRICS_USERNAME}
           password:${GRAFANA_CLOUD_API_KEY}
       scrape_configs:
       - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
         job_name: integrations/kubernetes/cadvisor
         kubernetes_sd_configs:
             - role: node
         metric_relabel_configs:
             - source_labels: [__name__]
               regex: container_fs_writes_bytes_total|container_memory_working_set_bytes|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_rss|kube_pod_owner|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|container_cpu_cfs_periods_total|kubelet_pleg_relist_interval_seconds_bucket|kube_statefulset_status_update_revision|container_fs_writes_total|kube_pod_container_resource_requests|kube_deployment_metadata_generation|kubelet_certificate_manager_server_ttl_seconds|kube_node_status_allocatable|kubelet_running_pod_count|volume_manager_total_volumes|kube_pod_container_resource_limits|kube_persistentvolumeclaim_resource_requests_storage_bytes|rest_client_requests_total|kubelet_volume_stats_inodes_used|kube_job_status_start_time|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_memory_cache|node_namespace_pod_container:container_memory_swap|container_fs_reads_bytes_total|kube_horizontalpodautoscaler_status_current_replicas|kubelet_pod_start_duration_seconds_count|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_workload_pod|kubelet_running_containers|kubernetes_build_info|kube_node_status_condition|node_namespace_pod_container:container_memory_cache|kubelet_cgroup_manager_duration_seconds_bucket|kube_pod_status_phase|container_network_transmit_packets_dropped_total|kubelet_runtime_operations_errors_total|kube_statefulset_status_current_revision|container_memory_swap|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kube_daemonset_status_desired_number_scheduled|kube_deployment_status_replicas_available|machine_memory_bytes|kube_node_info|kubelet_running_pods|kubelet_server_expiration_renew_errors|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|container_network_transmit_packets_total|go_goroutines|kube_pod_container_status_waiting_reason|kube_horizontalpodautoscaler_spec_min_replicas|kube_job_failed|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_current_number_scheduled|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_node_name|kube_statefulset_replicas|namespace_cpu:kube_pod_container_resource_limits:sum|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_statefulset_status_observed_generation|container_network_transmit_bytes_total|kubelet_volume_stats_available_bytes|kubelet_certificate_manager_client_ttl_seconds|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_bucket|storage_operation_errors_total|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_info|process_resident_memory_bytes|kubelet_pleg_relist_duration_seconds_count|kube_daemonset_status_number_misscheduled|container_network_receive_packets_dropped_total|node_filesystem_size_bytes|kube_statefulset_status_replicas_ready|kube_deployment_status_replicas_updated|kube_deployment_status_observed_generation|kube_deployment_spec_replicas|kube_statefulset_status_replicas_updated|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_namespace_status_phase|node_namespace_pod_container:container_memory_working_set_bytes|kubelet_pod_start_duration_seconds_bucket|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes|kube_daemonset_status_updated_number_scheduled|kube_replicaset_owner|container_network_receive_bytes_total|namespace_cpu:kube_pod_container_resource_requests:sum|container_fs_reads_total|kube_node_status_capacity|kube_node_spec_taint|storage_operation_duration_seconds_count|kubelet_pod_worker_duration_seconds_count|kube_resourcequota|node_filesystem_avail_bytes|kubelet_cgroup_manager_duration_seconds_count|kube_horizontalpodautoscaler_status_desired_replicas|namespace_memory:kube_pod_container_resource_limits:sum|container_cpu_cfs_throttled_periods_total|kubelet_node_config_error|kube_job_status_active|kube_daemonset_status_number_available|kube_statefulset_status_replicas|container_network_receive_packets_total|kube_pod_status_reason|kubelet_running_container_count|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
               action: keep
         relabel_configs:
             - replacement: kubernetes.default.svc.cluster.local:443
               target_label: __address__
             - regex: (.+)
               replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
               source_labels:
                 - __meta_kubernetes_node_name
               target_label: __metrics_path__
         scheme: https
         tls_config:
             ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
             insecure_skip_verify: false
             server_name: kubernetes
       - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
         job_name: integrations/kubernetes/kubelet
         kubernetes_sd_configs:
             - role: node
         metric_relabel_configs:
             - source_labels: [__name__]
               regex: container_fs_writes_bytes_total|container_memory_working_set_bytes|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_rss|kube_pod_owner|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|container_cpu_cfs_periods_total|kubelet_pleg_relist_interval_seconds_bucket|kube_statefulset_status_update_revision|container_fs_writes_total|kube_pod_container_resource_requests|kube_deployment_metadata_generation|kubelet_certificate_manager_server_ttl_seconds|kube_node_status_allocatable|kubelet_running_pod_count|volume_manager_total_volumes|kube_pod_container_resource_limits|kube_persistentvolumeclaim_resource_requests_storage_bytes|rest_client_requests_total|kubelet_volume_stats_inodes_used|kube_job_status_start_time|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_memory_cache|node_namespace_pod_container:container_memory_swap|container_fs_reads_bytes_total|kube_horizontalpodautoscaler_status_current_replicas|kubelet_pod_start_duration_seconds_count|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_workload_pod|kubelet_running_containers|kubernetes_build_info|kube_node_status_condition|node_namespace_pod_container:container_memory_cache|kubelet_cgroup_manager_duration_seconds_bucket|kube_pod_status_phase|container_network_transmit_packets_dropped_total|kubelet_runtime_operations_errors_total|kube_statefulset_status_current_revision|container_memory_swap|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kube_daemonset_status_desired_number_scheduled|kube_deployment_status_replicas_available|machine_memory_bytes|kube_node_info|kubelet_running_pods|kubelet_server_expiration_renew_errors|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|container_network_transmit_packets_total|go_goroutines|kube_pod_container_status_waiting_reason|kube_horizontalpodautoscaler_spec_min_replicas|kube_job_failed|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_current_number_scheduled|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_node_name|kube_statefulset_replicas|namespace_cpu:kube_pod_container_resource_limits:sum|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_statefulset_status_observed_generation|container_network_transmit_bytes_total|kubelet_volume_stats_available_bytes|kubelet_certificate_manager_client_ttl_seconds|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_bucket|storage_operation_errors_total|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_info|process_resident_memory_bytes|kubelet_pleg_relist_duration_seconds_count|kube_daemonset_status_number_misscheduled|container_network_receive_packets_dropped_total|node_filesystem_size_bytes|kube_statefulset_status_replicas_ready|kube_deployment_status_replicas_updated|kube_deployment_status_observed_generation|kube_deployment_spec_replicas|kube_statefulset_status_replicas_updated|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_namespace_status_phase|node_namespace_pod_container:container_memory_working_set_bytes|kubelet_pod_start_duration_seconds_bucket|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes|kube_daemonset_status_updated_number_scheduled|kube_replicaset_owner|container_network_receive_bytes_total|namespace_cpu:kube_pod_container_resource_requests:sum|container_fs_reads_total|kube_node_status_capacity|kube_node_spec_taint|storage_operation_duration_seconds_count|kubelet_pod_worker_duration_seconds_count|kube_resourcequota|node_filesystem_avail_bytes|kubelet_cgroup_manager_duration_seconds_count|kube_horizontalpodautoscaler_status_desired_replicas|namespace_memory:kube_pod_container_resource_limits:sum|container_cpu_cfs_throttled_periods_total|kubelet_node_config_error|kube_job_status_active|kube_daemonset_status_number_available|kube_statefulset_status_replicas|container_network_receive_packets_total|kube_pod_status_reason|kubelet_running_container_count|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
               action: keep
         relabel_configs:
             - replacement: kubernetes.default.svc.cluster.local:443
               target_label: __address__
             - regex: (.+)
               replacement: /api/v1/nodes/${1}/proxy/metrics
               source_labels:
                 - __meta_kubernetes_node_name
               target_label: __metrics_path__
         scheme: https
         tls_config:
             ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
             insecure_skip_verify: false
             server_name: kubernetes
       - job_name: integrations/kubernetes/kube-state-metrics
         kubernetes_sd_configs:
             - role: pod
         metric_relabel_configs:
             - source_labels: [__name__]
               regex: container_fs_writes_bytes_total|container_memory_working_set_bytes|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_rss|kube_pod_owner|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|container_cpu_cfs_periods_total|kubelet_pleg_relist_interval_seconds_bucket|kube_statefulset_status_update_revision|container_fs_writes_total|kube_pod_container_resource_requests|kube_deployment_metadata_generation|kubelet_certificate_manager_server_ttl_seconds|kube_node_status_allocatable|kubelet_running_pod_count|volume_manager_total_volumes|kube_pod_container_resource_limits|kube_persistentvolumeclaim_resource_requests_storage_bytes|rest_client_requests_total|kubelet_volume_stats_inodes_used|kube_job_status_start_time|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_memory_cache|node_namespace_pod_container:container_memory_swap|container_fs_reads_bytes_total|kube_horizontalpodautoscaler_status_current_replicas|kubelet_pod_start_duration_seconds_count|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_workload_pod|kubelet_running_containers|kubernetes_build_info|kube_node_status_condition|node_namespace_pod_container:container_memory_cache|kubelet_cgroup_manager_duration_seconds_bucket|kube_pod_status_phase|container_network_transmit_packets_dropped_total|kubelet_runtime_operations_errors_total|kube_statefulset_status_current_revision|container_memory_swap|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kube_daemonset_status_desired_number_scheduled|kube_deployment_status_replicas_available|machine_memory_bytes|kube_node_info|kubelet_running_pods|kubelet_server_expiration_renew_errors|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|container_network_transmit_packets_total|go_goroutines|kube_pod_container_status_waiting_reason|kube_horizontalpodautoscaler_spec_min_replicas|kube_job_failed|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_current_number_scheduled|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_node_name|kube_statefulset_replicas|namespace_cpu:kube_pod_container_resource_limits:sum|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_statefulset_status_observed_generation|container_network_transmit_bytes_total|kubelet_volume_stats_available_bytes|kubelet_certificate_manager_client_ttl_seconds|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_bucket|storage_operation_errors_total|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_info|process_resident_memory_bytes|kubelet_pleg_relist_duration_seconds_count|kube_daemonset_status_number_misscheduled|container_network_receive_packets_dropped_total|node_filesystem_size_bytes|kube_statefulset_status_replicas_ready|kube_deployment_status_replicas_updated|kube_deployment_status_observed_generation|kube_deployment_spec_replicas|kube_statefulset_status_replicas_updated|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_namespace_status_phase|node_namespace_pod_container:container_memory_working_set_bytes|kubelet_pod_start_duration_seconds_bucket|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes|kube_daemonset_status_updated_number_scheduled|kube_replicaset_owner|container_network_receive_bytes_total|namespace_cpu:kube_pod_container_resource_requests:sum|container_fs_reads_total|kube_node_status_capacity|kube_node_spec_taint|storage_operation_duration_seconds_count|kubelet_pod_worker_duration_seconds_count|kube_resourcequota|node_filesystem_avail_bytes|kubelet_cgroup_manager_duration_seconds_count|kube_horizontalpodautoscaler_status_desired_replicas|namespace_memory:kube_pod_container_resource_limits:sum|container_cpu_cfs_throttled_periods_total|kubelet_node_config_error|kube_job_status_active|kube_daemonset_status_number_available|kube_statefulset_status_replicas|container_network_receive_packets_total|kube_pod_status_reason|kubelet_running_container_count|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
               action: keep
         relabel_configs:
             - action: keep
               regex: kube-state-metrics
               source_labels:
                 - __meta_kubernetes_pod_label_app_kubernetes_io_name
       - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
         job_name: integrations/node_exporter
         kubernetes_sd_configs:
             - namespaces:
                 names:
                     - default
               role: pod
         metric_relabel_configs:
             - source_labels: [__name__]
               regex: container_fs_writes_bytes_total|container_memory_working_set_bytes|kube_statefulset_metadata_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_rss|kube_pod_owner|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|container_cpu_cfs_periods_total|kubelet_pleg_relist_interval_seconds_bucket|kube_statefulset_status_update_revision|container_fs_writes_total|kube_pod_container_resource_requests|kube_deployment_metadata_generation|kubelet_certificate_manager_server_ttl_seconds|kube_node_status_allocatable|kubelet_running_pod_count|volume_manager_total_volumes|kube_pod_container_resource_limits|kube_persistentvolumeclaim_resource_requests_storage_bytes|rest_client_requests_total|kubelet_volume_stats_inodes_used|kube_job_status_start_time|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_memory_cache|node_namespace_pod_container:container_memory_swap|container_fs_reads_bytes_total|kube_horizontalpodautoscaler_status_current_replicas|kubelet_pod_start_duration_seconds_count|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_workload_pod|kubelet_running_containers|kubernetes_build_info|kube_node_status_condition|node_namespace_pod_container:container_memory_cache|kubelet_cgroup_manager_duration_seconds_bucket|kube_pod_status_phase|container_network_transmit_packets_dropped_total|kubelet_runtime_operations_errors_total|kube_statefulset_status_current_revision|container_memory_swap|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kube_daemonset_status_desired_number_scheduled|kube_deployment_status_replicas_available|machine_memory_bytes|kube_node_info|kubelet_running_pods|kubelet_server_expiration_renew_errors|container_cpu_usage_seconds_total|namespace_workload_pod:kube_pod_owner:relabel|container_network_transmit_packets_total|go_goroutines|kube_pod_container_status_waiting_reason|kube_horizontalpodautoscaler_spec_min_replicas|kube_job_failed|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_current_number_scheduled|namespace_memory:kube_pod_container_resource_requests:sum|kubelet_node_name|kube_statefulset_replicas|namespace_cpu:kube_pod_container_resource_limits:sum|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_statefulset_status_observed_generation|container_network_transmit_bytes_total|kubelet_volume_stats_available_bytes|kubelet_certificate_manager_client_ttl_seconds|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_bucket|storage_operation_errors_total|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_info|process_resident_memory_bytes|kubelet_pleg_relist_duration_seconds_count|kube_daemonset_status_number_misscheduled|container_network_receive_packets_dropped_total|node_filesystem_size_bytes|kube_statefulset_status_replicas_ready|kube_deployment_status_replicas_updated|kube_deployment_status_observed_generation|kube_deployment_spec_replicas|kube_statefulset_status_replicas_updated|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_namespace_status_phase|node_namespace_pod_container:container_memory_working_set_bytes|kubelet_pod_start_duration_seconds_bucket|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes|kube_daemonset_status_updated_number_scheduled|kube_replicaset_owner|container_network_receive_bytes_total|namespace_cpu:kube_pod_container_resource_requests:sum|container_fs_reads_total|kube_node_status_capacity|kube_node_spec_taint|storage_operation_duration_seconds_count|kubelet_pod_worker_duration_seconds_count|kube_resourcequota|node_filesystem_avail_bytes|kubelet_cgroup_manager_duration_seconds_count|kube_horizontalpodautoscaler_status_desired_replicas|namespace_memory:kube_pod_container_resource_limits:sum|container_cpu_cfs_throttled_periods_total|kubelet_node_config_error|kube_job_status_active|kube_daemonset_status_number_available|kube_statefulset_status_replicas|container_network_receive_packets_total|kube_pod_status_reason|kubelet_running_container_count|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
               action: keep
         relabel_configs:
             - action: keep
               regex: prometheus-node-exporter.*
               source_labels:
                 - __meta_kubernetes_pod_label_app_kubernetes_io_name
             - action: replace
               source_labels:
                 - __meta_kubernetes_pod_node_name
               target_label: instance
             - action: replace
               source_labels:
                 - __meta_kubernetes_namespace
               target_label: namespace
         tls_config:
             ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
             insecure_skip_verify: false
      
   integrations:
     eventhandler:
       cache_path: /var/lib/agent/eventhandler.cache
       logs_instance: integrations
   logs:
     configs:
     - name: integrations
       clients:
       - url: ${GRAFANA_CLOUD_API_KEY}
         basic_auth:
           username: ${LOGS_USERNAME}
           password: ${LOKI_URL}
         external_labels:
           cluster: cloud
           job: integrations/kubernetes/eventhandler
       positions:
         filename: /tmp/positions.yaml
       target_config:
         sync_period: 10s

This ConfigMap configures the Grafana Agent StatefulSet to scrape the cadvisor, kubelet, and kube-state-metrics endpoints in your cluster. It also configures Grafana Agent to collect Kubernetes events from your cluster’s control plane and ship these to your Grafana Cloud Loki instance. To learn more about configuring Grafana Agent, see our  Grafana Agent configuration reference docs.

Create Grafana Agent StatefulSet

Replace ${NAMESPACE} with your namespace and paste the following script into your shell and run it to create the configuration file for metrics Grafana Agent StatefulSet.

cat >> metrics-agent.yaml <<'EOF'
apiVersion: v1
kind: ServiceAccount
metadata:
 name: grafana-agent
 namespace: ${NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: grafana-agent
rules:
- apiGroups:
 - ""
 resources:
 - nodes
 - nodes/proxy
 - services
 - endpoints
 - pods
 - events
 verbs:
 - get
 - list
 - watch
- nonResourceURLs:
 - /metrics
 verbs:
 - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: grafana-agent
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: grafana-agent
subjects:
- kind: ServiceAccount
 name: grafana-agent
 namespace: ${NAMESPACE}
---
apiVersion: v1
kind: Service
metadata:
 labels:
   name: grafana-agent
 name: grafana-agent
 namespace: ${NAMESPACE}
spec:
 clusterIP: None
 ports:
 - name: grafana-agent-http-metrics
   port: 80
   targetPort: 80
 selector:
   name: grafana-agent
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: grafana-agent
 namespace: ${NAMESPACE}
spec:
 replicas: 1
 selector:
   matchLabels:
     name: grafana-agent
 serviceName: grafana-agent
 template:
   metadata:
     labels:
       name: grafana-agent
   spec:
     containers:
     - args:
       - -config.expand-env=true
       - -config.file=/etc/agent/agent.yaml
       - -enable-features=integrations-next
       - -server.http.address=0.0.0.0:80
       env:
       - name: HOSTNAME
         valueFrom:
           fieldRef:
             fieldPath: spec.nodeName
       image: grafana/agent:v0.32.1
       imagePullPolicy: IfNotPresent
       name: grafana-agent
       ports:
       - containerPort: 80
         name: http-metrics
       volumeMounts:
       - mountPath: /var/lib/agent
         name: agent-wal
       - mountPath: /etc/agent
         name: grafana-agent
     serviceAccountName: grafana-agent
     volumes:
     - configMap:
         name: grafana-agent
       name: grafana-agent
 updateStrategy:
   type: RollingUpdate
 volumeClaimTemplates:
 - apiVersion: v1
   kind: PersistentVolumeClaim
   metadata:
     name: agent-wal
     namespace: ${NAMESPACE}
   spec:
     accessModes:
     - ReadWriteOnce
     resources:
       requests:
         storage: 5Gi

Kube-state-metrics

kube-state-metrics watches Kubernetes resources in your cluster and emits Prometheus metrics that can be scraped by Grafana Agent. To learn more, see the kube-state-metrics docs.

Replace ${Argo CD_NAMESPACE} with your namespace in which Argo CD is deployed and ${NAMESPACE} with the namespace where you want to deploy Kubernetes Monitoring resources. Then paste the following script into your shell and run it to create the configuration file for kube-state-metrics Argo CD application.

Node_exporter

Node_exporter is used to collect host metrics from each node in your cluster, and emits them as Prometheus metrics.

Replace ${Argo CD_NAMESPACE} with your namespace in which Argo CD is deployed and ${NAMESPACE} with the namespace where you want to deploy Kubernetes Monitoring resources. Then paste the following script into your shell and run it to create the configuration file for node_exporter Argo CD application.

cat >> node-exporter.yaml <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
 name: nodeexporter
 namespace: ${Argo CD_NAMESPACE}
spec:
 destination:
   name: ''
   namespace: ${NAMESPACE}
   server: 'https://kubernetes.default.svc'
 source:
   path: ''
   repoURL: 'https://prometheus-community.github.io/helm-charts'
   targetRevision: 4.16.0
   chart: prometheus-node-exporter
 sources: []
 project: default
 syncPolicy:
   automated:
     prune: true
     selfHeal: true
   retry:
     limit: 2
     backoff:
       duration: 5s
       maxDuration: 3m0s
       factor: 2
   syncOptions:
     - CreateNamespace=true

Logs: Deploy Grafana Agent ConfigMap and DaemonSet

Create Grafana Agent ConfigMap

Replace
${NAMESPACE} with your namespace
${METRICS_USERNAME} with Prometheus Instance ID
${PROMETHEUS_URL} with the push endpoint URL of the Prometheus Instance
${LOGS_USERNAME} with Loki Username
${LOKI_URL} with the push endpoint URL of the Loki Instance
${GRAFANA_CLOUD_API_KEY} 

Paste the following script into your shell and run it to create the logs Agent ConfigMap configuration file.

cat >> logs.yaml <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
 name: grafana-agent-logs
 namespace: ${NAMESPACE}
data:
 agent.yaml: |   
   metrics:
     wal_directory: /tmp/grafana-agent-wal
     global:
       scrape_interval: 60s
       external_labels:
         cluster: cloud
     configs:
     - name: integrations
       remote_write:
       - url: ${PROMETHEUS_URL}
         basic_auth:
           username: ${METRICS_USERNAME}
           password: ${GRAFANA_CLOUD_API_KEY}
   integrations:
     prometheus_remote_write:
     - url: ${PROMETHEUS_URL}
       basic_auth:
         username: ${METRICS_USERNAME}
         password: ${GRAFANA_CLOUD_API_KEY}
    
    
   logs:
     configs:
     - name: integrations
       clients:
       - url: ${LOKI_URL}
         basic_auth:
           username: ${LOGS_USERNAME}
           password: ${GRAFANA_CLOUD_API_KEY}
         external_labels:
           cluster: cloud
       positions:
         filename: /tmp/positions.yaml
       target_config:
         sync_period: 10s
       scrape_configs:
       - job_name: integrations/kubernetes/pod-logs
         kubernetes_sd_configs:
           - role: pod
         pipeline_stages:
           - docker: {}
         relabel_configs:
           - source_labels:
               - __meta_kubernetes_pod_node_name
             target_label: __host__
           - action: replace
             replacement: $1
             separator: /
             source_labels:
               - __meta_kubernetes_namespace
               - __meta_kubernetes_pod_name
             target_label: job
           - action: replace
             source_labels:
               - __meta_kubernetes_namespace
             target_label: namespace
           - action: replace
             source_labels:
               - __meta_kubernetes_pod_name
             target_label: pod
           - action: replace
             source_labels:
               - __meta_kubernetes_pod_container_name
             target_label: container
           - replacement: /var/log/pods/*$1/*.log
             separator: /
             source_labels:
               - __meta_kubernetes_pod_uid
               - __meta_kubernetes_pod_container_name
             target_label: __path__

Create Grafana Agent DaemonSet

Replace ${NAMESPACE} with your namespace and paste the following script into your shell and run it to create the configuration file for Logs Grafana Agent DaemonSet.

cat >> logs-agent.yaml <<'EOF'
apiVersion: v1
kind: ServiceAccount
metadata:
 name: grafana-agent-logs
 namespace: ${NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: grafana-agent-logs
rules:
- apiGroups:
 - ""
 resources:
 - nodes
 - nodes/proxy
 - services
 - endpoints
 - pods
 - events
 verbs:
 - get
 - list
 - watch
- nonResourceURLs:
 - /metrics
 verbs:
 - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: grafana-agent-logs
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: grafana-agent-logs
subjects:
- kind: ServiceAccount
 name: grafana-agent-logs
 namespace: default
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: grafana-agent-logs
 namespace: ${NAMESPACE}
spec:
 minReadySeconds: 10
 selector:
   matchLabels:
     name: grafana-agent-logs
 template:
   metadata:
     labels:
       name: grafana-agent-logs
   spec:
     containers:
     - args:
       - -config.expand-env=true
       - -config.file=/etc/agent/agent.yaml
       - -server.http.address=0.0.0.0:80
       env:
       - name: HOSTNAME
         valueFrom:
           fieldRef:
             fieldPath: spec.nodeName
       image: grafana/agent:v0.32.1
       imagePullPolicy: IfNotPresent
       name: grafana-agent-logs
       ports:
       - containerPort: 80
         name: http-metrics
       securityContext:
         privileged: true
         runAsUser: 0
       volumeMounts:
       - mountPath: /etc/agent
         name: grafana-agent-logs
       - mountPath: /var/log
         name: varlog
       - mountPath: /var/lib/docker/containers
         name: varlibdockercontainers
         readOnly: true
     serviceAccountName: grafana-agent-logs
     tolerations:
     - effect: NoSchedule
       operator: Exists
     volumes:
     - configMap:
         name: grafana-agent-logs
       name: grafana-agent-logs
     - hostPath:
         path: /var/log
       name: varlog
     - hostPath:
         path: /var/lib/docker/containers
       name: varlibdockercontainers
 updateStrategy:
   type: RollingUpdate

Git repository setup

The ideal setup is to have changes that are made to the main branch of your Git repository automatically synced within Kubernetes. To achieve this, follow the steps below:

  1. Create a folder in your Git repository (For this post, let’s create a folder named kubernetes-monitoring).
  2. Push all the manifest files for earlier steps (metrics.yaml, metrics-agent.yaml, kube-state-metrics.yaml, node-exporter.yaml, logs.yaml and logs-agent.yaml) into the newly created folder.

Argo CD application setup

Screenshot of Argo CD web interface for creating a new app.
ArgoCD New App web interface.

In the Argo CD web interface, click on the New App button. Enter a name for the application, such as grafana-kubernetes-monitoring, and select the your desired project name (if there is none, select default) and sync policy as automatic. Select Prune Resources and Self Heal options to ensure configuration in Git Repository is always maintained within the Kubernetes cluster. Also tick the Auto-Create Namespace option and the Retry options. 

Under Source, enter the URL for your Git repository along with the revision (by default it is HEAD). Set the path to the folder (kubernetes-monitoring). 

Under Destination, enter your cluster URL. If it’s the same cluster where Argo CD is deployed, use https://kubernetes.default.svc and the namespace for creating the application resource. The namespace should be the same as the one where Argo CD is deployed.

Select Directory Recurse as its useful when having configuration files within nested folders

Finally, click on the Create button to create the application.

You can also run the below command to create the Argo CD application that monitors your Git Repository folder for changes. Make sure to replace ${GIT_REPO_URL} with your Git Repository URL; ${Argo CD_NAMESPACE} with your namespace in which Argo CD is deployed; and ${NAMESPACE} with the namespace where you want to deploy Kubernetes Monitoring resources.

cat <<'EOF' | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
 name: kubernetes-monitoring
 namespace: ${Argo CD_NAMESPACE}
spec:
 destination:
   name: ''
   namespace: ${NAMESPACE}
   server: 'https://kubernetes.default.svc'
 source:
   path: kubernetes-monitoring
   repoURL: '${GIT_REPO_URL}'
   targetRevision: HEAD
   directory:
     recurse: true
 sources: []
 project: default
 syncPolicy:
   automated:
     prune: true
     selfHeal: true
   retry:
     limit: 3
     backoff:
       duration: 5s
       maxDuration: 3m0s
       factor: 2
   syncOptions:
     - CreateNamespace=true

Once the application has been created, the Kubernetes manifest defined in the Git repository will be automatically deployed.

Install prebuilt Grafana dashboards and alerts

Note: You must have the Admin role to install dashboards and alerts in Grafana Cloud.

Navigate to your Grafana Cloud instance and in the navigation menu, select Kubernetes under the Monitoring section.

Go to the configuration and click on Install dashboards and alert rules.

The configuration has been already completed, so you can now start viewing dashboards.

Once the metrics and logs from your Kubernetes cluster are being ingested into Grafana Cloud, you can:

  • Easily navigate around your Kubernetes setup using a single UI. Begin with a cluster view and drill down to specific Kubernetes pods with the preconfigured Grafana dashboards.
  • Use node observability. Find out more about node health along with pod density and node logs.
  • Use the cluster efficiency dashboard. Understand your resource usage and use the information for capacity planning and optimizing costs and efficiency. 
  • Stay on top of your cluster. Use the preconfigured Grafana alerts to monitor your fleet.

Next steps for monitoring Kubernetes clusters with Grafana Cloud 

Once you have settled into your new Argo CD configuration, you can continue to expand on your observability strategy with Kubernetes Monitoring in Grafana Cloud.

You can leverage the latest integrations available within the Kubernetes Monitoring solution such as Cert Manager, CockroachDB and more. To configure scraping application pod metrics, you can refer to our detailed documentation. And while Kubernetes Monitoring currently covers metrics and logs, to get a complete timeline into data transactions, request durations, and bottlenecks, it is also important to use traces. You can configure your applications to send traces to Grafana Agent, which can then be configured to send that tracing data to Grafana Cloud’s hosted tracing instance.

The easiest way to get started with Kubernetes Monitoring is with Grafana Cloud. We have a generous free forever tier and plans for every use case. Sign up for free today!