Monitor infrastructure

Kubernetes Monitoring

Configure Kubernetes Monitoring

Configure with Helm chart

OTel

Grafana Cloud

Send Kubernetes metrics, logs, and events using the OpenTelemetry Collector

If you currently have an OpenTelemetry Collector-based system in your Cluster, use these instructions.

Note
If you do not have an OpenTelemetry Collector-based system set up in your Cluster, consider instead configuring with Grafana Kubernetes Monitoring Helm chart. This option offers more features and easier integration.

These instructions include:

Using the OpenTelemetry Collector to send metrics to Grafana Cloud
Enabling logs with the OpenTelemetry Logs Collector
Enabling capturing Cluster events with the OpenTelemetry Collector

After connecting, you can view your resources, as well as their metrics and logs, in the Grafana Cloud Kubernetes integration.

Note
To gather metrics and logs, you perform two separate deployments of the OTel collector: 1) A Deployment or StatefulSet on a single Pod for metrics, and 2) A DaemonSet to put a collector on each Node to gather the Pod logs.

Before you begin

Before you begin the configuration steps, have the following available:

A Kubernetes Cluster with role-based access control (RBAC) enabled
A Grafana Cloud account. To create an account, navigate to Grafana Cloud, and click Create free account.
The kubectl command-line tool installed on your local machine, configured to connect to your Cluster
The helm command-line tool installed on your local machine. If you already have working kube-state-metrics and node-exporter instances installed in your Cluster, skip this step.
A working OpenTelemetry Collector deployment. For more information, refer to the OpenTelemetry Collector documentation.

Configuration steps

Follow these steps to configure sending metrics and logs:

Set up the exporters for metrics.
Configure the OpenTelemetry Collector for metrics.
Configure the OpenTelemetry Collector for logs.
Configure the OpenTelemetry Collector for Cluster events.
Set up the Kubernetes integration in Grafana Cloud.

Set up exporters

The Grafana Cloud Kubernetes integration requires metrics from specific exporters. Some are embedded in the kubelet, while others require deployment.

The following exporters are embedded in the kubelet:

kubelet metrics for utilization and efficiency analysis
cAdvisor for usage statistics on a container level

These exporters require deployment:

kube-state-metrics to display available resources
node-exporter for Node-level metrics

Note
Due to differences in the metrics returned by the integrated Kubernetes Cluster Receiver and Kubelet Stats Receiver, Grafana Kubernetes Monitoring cannot use these.

If you already have kube-state-metrics and node_exporter instances deployed in your Cluster, skip the next two steps.

Set up `kube-state-metrics`

Run the following commands from your shell to install kube-state-metrics into the default namespace of your Kubernetes Cluster:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install ksm prometheus-community/kube-state-metrics -n "default"

To deploy kube-state-metrics into a different namespace, replace default in the preceding command with a different value.

Set up `node_exporter`

Run the following commands from your shell to install node_exporter into the default namespace of your Kubernetes Cluster:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install nodeexporter prometheus-community/prometheus-node-exporter -n "default"

This creates a DaemonSet to expose metrics on every Node in your Cluster.

To deploy the node_exporter into a different namespace, replace default in the previous command with a different value.

Configure the OpenTelemetry Metrics Collector

To configure the OpenTelemetry Collector:

Add targeted endpoints for scraping.
Include the remote write exporter to send metrics to Grafana Cloud.
Link collected metrics to the remote write exporter.

Add scraping endpoints

Add the following to your OpenTelemetry Collector configuration. The configuration is usually available in a ConfigMap.

receivers:
  prometheus:
    config:
      scrape_configs:
        - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          job_name: integrations/kubernetes/cadvisor
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - replacement: kubernetes.default.svc.cluster.local:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
              source_labels:
                - __meta_kubernetes_node_name
              target_label: __metrics_path__
          metric_relabel_configs:
            - source_labels: [__name__]
              action: keep
              regex: 'container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|machine_memory_bytes'
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
            server_name: kubernetes

        - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          job_name: integrations/kubernetes/kubelet
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - replacement: kubernetes.default.svc.cluster.local:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$${1}/proxy/metrics
              source_labels:
                - __meta_kubernetes_node_name
              target_label: __metrics_path__
          metric_relabel_configs:
            - source_labels: [__name__]
              action: keep
              regex: 'container_cpu_usage_seconds_total|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubernetes_build_info|namespace_workload_pod|rest_client_requests_total|storage_operation_duration_seconds_count|storage_operation_errors_total|volume_manager_total_volumes'
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
            server_name: kubernetes

        - job_name: integrations/kubernetes/kube-state-metrics
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: keep
              regex: kube-state-metrics
              source_labels:
                - __meta_kubernetes_pod_label_app_kubernetes_io_name
          metric_relabel_configs:
            - source_labels: [__name__]
              action: keep
              regex: 'kube_daemonset.*|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job.*|kube_namespace_status_phase|kube_node.*|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_pod_container_info|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_last_terminated_reason|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_start_time|kube_pod_status_phase|kube_pod_status_reason|kube_replicaset.*|kube_resourcequota|kube_statefulset.*'

        - job_name: integrations/node_exporter
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: keep
              regex: prometheus-node-exporter.*
              source_labels:
                - __meta_kubernetes_pod_label_app_kubernetes_io_name
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_node_name
              target_label: instance
            - action: replace
              source_labels:
                - __meta_kubernetes_namespace
              target_label: namespace
          metric_relabel_configs:
            - source_labels: [__name__]
              action: keep
              regex: 'node_cpu.*|node_exporter_build_info|node_filesystem.*|node_memory.*|process_cpu_seconds_total|process_resident_memory_bytes'

This configuration adds four scrape targets with specific functions for discovery and scraping:

All Nodes, scraping their cAdvisor endpoint (integrations/kubernetes/cadvisor)
All Nodes, scraping their Kubelet metrics endpoint (integrations/kubernetes/kubelet)
All Pods with the app.kubernetes.io/name=kube-state-metrics label, scraping their /metrics endpoint (integrations/kubernetes/kube-state-metrics)
All Pods with the app.kubernetes.io/name=prometheus-node-exporter.* label, scraping their /metrics endpoint (integrations/node_exporter)

Warning
For the Kubernetes integration to work correctly, you must set the job and instance labels exactly as prescribed in the preceding steps to be able to see your Cluster in Kubernetes Monitoring.

Each scrape target has a list of metrics to keep, which reduces the amount of unnecessary metrics sent to Grafana Cloud.

Set up RBAC for OpenTelemetry Metrics Collector

This configuration uses the built-in Kubernetes service discovery, so you must set up the service account running the OpenTelemetry Collector with advanced permissions (compared to the default set). The following ClusterRole provides a good starting point:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups:
      - ''
    resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
      - events
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /metrics
    verbs:
      - get

To bind this to a ServiceAccount, use the following ClusterRoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector
subjects:
  - kind: ServiceAccount
    name: otel-collector # replace with your service account name
    namespace: default # replace with your namespace
roleRef:
  kind: ClusterRole
  name: otel-collector
  apiGroup: rbac.authorization.k8s.io

Configure the remote write exporter

To send the metrics to Grafana Cloud, add the following to your OpenTelemetry Collector configuration:

exporters:
  prometheusremotewrite:
    external_labels:
      cluster: 'your-cluster-name'
    endpoint: 'https://PROMETHEUS_USERNAME:ACCESS_POLICY_TOKEN@PROMETHEUS_URL/api/prom/push'

To retrieve your connection information:

Go to your Grafana Cloud account.
Select the correct organization in the dropdown menu.
Select your desired stack in the main navigation on the left.
Click the Send Metrics button on the Prometheus card to find your connection information on the page that displays.

For the token, it is recommended that you:

Place it in a secure location.
Inject it into the configuration using environment variables.

Link collected metrics

Link the collected metrics to the remote write exporter. As a good practice, add a batch processor, which improves performance.

Add the following to the OpenTelemetry Collector configuration:

processors:
  batch:
service:
  pipelines:
    metrics/prod:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheusremotewrite]

After restarting your OpenTelemetry Collector, you should see the first metrics arriving in Grafana Cloud after a few minutes.

Configure the OpenTelemetry Logs Collector

Kubernetes writes logs to a specific file on the respective Node, so you must schedule a Pod on each Node to scrape these files. Do this with a separate DaemonSet.

The following configuration file configures the OpenTelemetry Collector to scrape logs from the default logging location for Kubernetes. Make sure you use the same Cluster name as with your metrics, otherwise the correlation won’t work.

# This is a new configuration file - do not merge this with your metrics configuration!
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: beginning
    include_file_path: true
    include_file_name: false
    operators:
      # Find out which format is used by kubernetes
      - type: router
        id: get-format
        routes:
          - output: parser-docker
            expr: 'body matches "^\\{"'
          - output: parser-crio
            expr: 'body matches "^[^ Z]+ "'
          - output: parser-containerd
            expr: 'body matches "^[^ Z]+Z"'
      # Parse CRI-O format
      - type: regex_parser
        id: parser-crio
        regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout_type: gotime
          layout: '2006-01-02T15:04:05.999999999Z07:00'
      # Parse CRI-Containerd format
      - type: regex_parser
        id: parser-containerd
        regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Parse Docker format
      - type: json_parser
        id: parser-docker
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Extract metadata from file path
      - type: regex_parser
        id: extract_metadata_from_filepath
        # Pod UID is not always 36 characters long
        regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{16,36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
        parse_from: attributes["log.file.path"]
        cache:
          size: 128 # default maximum amount of Pods per Node is 110
      # Rename attributes
      - type: move
        from: attributes["log.file.path"]
        to: resource["filename"]
      - type: move
        from: attributes.container_name
        to: resource["container"]
      - type: move
        from: attributes.namespace
        to: resource["namespace"]
      - type: move
        from: attributes.pod_name
        to: resource["pod"]
      - type: add
        field: resource["cluster"]
        value: 'your-cluster-name' # Set your cluster name here
      - type: move
        from: attributes.log
        to: body

processors:
  resource:
    attributes:
      - action: insert
        key: loki.format
        value: raw
      - action: insert
        key: loki.resource.labels
        value: pod, namespace, container, cluster, filename
exporters:
  loki:
    endpoint: https://LOKI_USERNAME:ACCESS_POLICY_TOKEN@LOKI_URL/loki/api/v1/push
service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [resource]
      exporters: [loki]

When you configure the DaemonSet, you must mount the correct directories for the collector to access the logs. For a detailed example, refer to the example deployment in the opentelemetry-collector-contrib repository.

Configure the OpenTelemetry Collector for Cluster events

Kubernetes controllers emit Events as they perform operations in your Cluster (like starting containers, scheduling Pods, etc.) and these can be a rich source of logging information to help you debug, monitor, and alert on your Kubernetes workloads. Generally, these Events can be queried using kubectl get event or kubectl describe. By enabling the OpenTelemetry Collector to capture these events and ship them to Grafana Cloud Loki, you can query these directly in Grafana Cloud.

To configure the OpenTelemetry Collector:

Add the k8s_events integration.
Include the exporter for it to send events as logs to Grafana Cloud Loki.
Link the collected events to the exporter.

Add the Kubernetes events integration

Add the following to your OpenTelemetry Collector configuration. You can usually find the configuration in a ConfigMap.

receivers:
  k8s_events:
    namespaces: []

processors:
  batch: {}
  resource/k8s_events:
    attributes:
      - action: insert
        key: cluster
        value: 'default-values-test'
      - action: insert
        key: job
        value: 'integrations/kubernetes/eventhandler'
      - action: insert
        key: loki.resource.labels
        value: job, cluster

Set up RBAC for OpenTelemetry Collector

To allow the OpenTelemetry Collector the correct permissions to scrape Kubernetes Cluster events, you must modify the service account running the OpenTelemetry Collector with advanced permissions (compared to the default set). The following ClusterRole provides a good starting point:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups:
      - ''
    resources:
      - events
      - namespaces
      - namespaces/status
      - nodes
      - nodes/spec
      - pods
      - pods/status
      - replicationcontrollers
      - replicationcontrollers/status
      - resourcequotas
      - services
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - daemonsets
      - deployments
      - replicasets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - batch
    resources:
      - jobs
      - cronjobs
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - autoscaling
    resources:
      - horizontalpodautoscalers
    verbs:
      - get
      - list
      - watch

Configure the exporter

To send the events to Grafana Cloud, add the following to your OpenTelemetry Collector configuration:

exporters:
  loki:
    endpoint: https://LOKI_USERNAME:ACCESS_POLICY_TOKEN@LOKI_URL/loki/api/v1/push

To retrieve your connection information:

Go to your Grafana Cloud account.
Select the correct organization in the drop-down menu.
Select your desired stack in the main navigation on the left.
Click the Send Logs button on the Prometheus card. A page displays showing your connection information.

For the token, it is recommended that you:

Place it in a secure location.
Inject it into the configuration using environment variables.

Link collected events

Link the collected events to the exporter. It is also good practice to add a batch processor to improve performance.

Add the following to the OpenTelemetry Collector configuration:

service:
  pipelines:
    logs/k8s_events:
      receivers: [k8s_events]
      processors: [batch, resource/k8s_events]
      exporters: [loki]

After restarting your OpenTelemetry Collector, you should see Kubernetes Cluster events arriving in Grafana Cloud after a few minutes.

Full example

You can perform the configuration of all the preceding steps with two deployments of the OpenTelemetry Collector Helm chart.

Deployment values

The following deploys an OpenTelemetry Collector as single instance Kubernetes Deployment that scrapes metrics and gathers Cluster events.

# Search for and replace the "REPLACE ME" fields

mode: deployment

clusterRole:
  create: true
  rules:
    - apiGroups:
        - ''
      resources:
        - nodes
        - nodes/proxy
        - services
        - endpoints
        - pods
        - events
        - namespaces
        - namespaces/status
        - pods/status
        - replicationcontrollers
        - replicationcontrollers/status
        - resourcequotas
      verbs:
        - get
        - list
        - watch
    - nonResourceURLs:
        - /metrics
      verbs:
        - get
    - apiGroups:
        - apps
      resources:
        - daemonsets
        - deployments
        - replicasets
        - statefulsets
      verbs:
        - get
        - list
        - watch
    - apiGroups:
        - extensions
      resources:
        - daemonsets
        - deployments
        - replicasets
      verbs:
        - get
        - list
        - watch
    - apiGroups:
        - batch
      resources:
        - jobs
        - cronjobs
      verbs:
        - get
        - list
        - watch
    - apiGroups:
        - autoscaling
      resources:
        - horizontalpodautoscalers
      verbs:
        - get
        - list
        - watch

config:
  extensions:
    basicauth/metricsService:
      client_auth:
        username: '' # REPLACE ME
        password: '' # REPLACE ME
    basicauth/logsService:
      client_auth:
        username: '' # REPLACE ME
        password: '' # REPLACE ME

  receivers:
    k8s_events:
      namespaces: []

    prometheus:
      config:
        scrape_configs:
          - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            job_name: integrations/kubernetes/cadvisor
            kubernetes_sd_configs:
              - role: node
            relabel_configs:
              - replacement: kubernetes.default.svc.cluster.local:443
                target_label: __address__
              - regex: (.+)
                replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
                source_labels:
                  - __meta_kubernetes_node_name
                target_label: __metrics_path__
            metric_relabel_configs:
              - source_labels: [__name__]
                action: keep
                regex: 'container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|machine_memory_bytes'
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: false
              server_name: kubernetes

          - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            job_name: integrations/kubernetes/kubelet
            kubernetes_sd_configs:
              - role: node
            relabel_configs:
              - replacement: kubernetes.default.svc.cluster.local:443
                target_label: __address__
              - regex: (.+)
                replacement: /api/v1/nodes/$${1}/proxy/metrics
                source_labels:
                  - __meta_kubernetes_node_name
                target_label: __metrics_path__
            metric_relabel_configs:
              - source_labels: [__name__]
                action: keep
                regex: 'container_cpu_usage_seconds_total|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubernetes_build_info|namespace_workload_pod|rest_client_requests_total|storage_operation_duration_seconds_count|storage_operation_errors_total|volume_manager_total_volumes'
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: false
              server_name: kubernetes

          - job_name: integrations/kubernetes/kube-state-metrics
            kubernetes_sd_configs:
              - role: pod
            relabel_configs:
              - action: keep
                regex: kube-state-metrics
                source_labels:
                  - __meta_kubernetes_pod_label_app_kubernetes_io_name
            metric_relabel_configs:
              - source_labels: [__name__]
                action: keep
                regex: 'kube_daemonset.*|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job.*|kube_namespace_status_phase|kube_node.*|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_pod_container_info|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_last_terminated_reason|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_start_time|kube_pod_status_phase|kube_pod_status_reason|kube_replicaset.*|kube_resourcequota|kube_statefulset.*'

          - job_name: integrations/node_exporter
            kubernetes_sd_configs:
              - role: pod
            relabel_configs:
              - action: keep
                regex: prometheus-node-exporter.*
                source_labels:
                  - __meta_kubernetes_pod_label_app_kubernetes_io_name
              - action: replace
                source_labels:
                  - __meta_kubernetes_pod_node_name
                target_label: instance
              - action: replace
                source_labels:
                  - __meta_kubernetes_namespace
                target_label: namespace
            metric_relabel_configs:
              - source_labels: [__name__]
                action: keep
                regex: 'node_cpu.*|node_exporter_build_info|node_filesystem.*|node_memory.*|process_cpu_seconds_total|process_resident_memory_bytes'

  processors:
    batch: {}
    resource/k8s_events:
      attributes:
        - action: insert
          key: cluster
          value: 'cluster-name' # REPLACE ME
        - action: insert
          key: job
          value: 'integrations/kubernetes/eventhandler'
        - action: insert
          key: loki.resource.labels
          value: job, cluster

  exporters:
    prometheusremotewrite/metricsService:
      endpoint: 'https://prometheus.example.com/api/v1/push' # REPLACE ME
      external_labels:
        cluster: 'cluster-name' # REPLACE ME
        'k8s.cluster.name': 'cluster-name' # REPLACE ME
      auth:
        authenticator: basicauth/metricsService

    loki:
      endpoint: https://loki.example.com/loki/api/v1/push # REPLACE ME
      auth:
        authenticator: basicauth/logsService

  service:
    extensions:
      - health_check
      - memory_ballast
      - basicauth/logsService
      - basicauth/metricsService
    pipelines:
      metrics/prod:
        receivers: [prometheus]
        processors: [batch]
        exporters: [prometheusremotewrite/metricsService]
      logs/k8s_events:
        receivers: [k8s_events]
        processors: [batch, resource/k8s_events]
        exporters: [loki]

DaemonSet configuration

The following deploys an OpenTelemetry Collector as a Kubernetes DaemonSet that gathers Pod logs.

# Search for and replace the "REPLACE ME" fields

mode: daemonset

extraVolumes:
  - name: varlog
    hostPath:
      path: /var/log

extraVolumeMounts:
  - name: varlog
    mountPath: /var/log
    readOnly: true

config:
  extensions:
    basicauth/logsService:
      client_auth:
        username: '' # REPLACE ME
        password: '' # REPLACE ME

  receivers:
    filelog:
      include:
        - /var/log/pods/*/*/*.log
      start_at: beginning
      include_file_path: true
      include_file_name: false
      operators:
        # Find out which format is used by kubernetes
        - type: router
          id: get-format
          routes:
            - output: parser-docker
              expr: 'body matches "^\\{"'
            - output: parser-crio
              expr: 'body matches "^[^ Z]+ "'
            - output: parser-containerd
              expr: 'body matches "^[^ Z]+Z"'
        # Parse CRI-O format
        - type: regex_parser
          id: parser-crio
          regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
          output: extract_metadata_from_filepath
          timestamp:
            parse_from: attributes.time
            layout_type: gotime
            layout: '2006-01-02T15:04:05.999999999Z07:00'
        # Parse CRI-Containerd format
        - type: regex_parser
          id: parser-containerd
          regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
          output: extract_metadata_from_filepath
          timestamp:
            parse_from: attributes.time
            layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        # Parse Docker format
        - type: json_parser
          id: parser-docker
          output: extract_metadata_from_filepath
          timestamp:
            parse_from: attributes.time
            layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        - type: move
          from: attributes.log
          to: body
        Extract metadata from file path
        - type: regex_parser
          id: extract_metadata_from_filepath
          # Pod UID is not always 36 characters long
          regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{16,36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
          parse_from: attributes["log.file.path"]
          cache:
            size: 128 # default maximum amount of Pods per Node is 110
        # Rename attributes
        - type: move
          from: attributes["log.file.path"]
          to: resource["filename"]
        - type: move
          from: attributes.container_name
          to: resource["container"]
        - type: move
          from: attributes.namespace
          to: resource["namespace"]
        - type: move
          from: attributes.pod_name
          to: resource["pod"]
        - type: add
          field: resource["cluster"]
          value: 'cluster-name' # REPLACE ME

  processors:
    resource:
      attributes:
        - action: insert
          key: loki.format
          value: raw
        - action: insert
          key: loki.resource.labels
          value: pod, namespace, container, cluster, filename

  exporters:
    loki:
      endpoint: https://loki.example.com/loki/api/v1/push # REPLACE ME
      auth:
        authenticator: basicauth/logsService

  service:
    extensions:
      - health_check
      - memory_ballast
      - basicauth/logsService
    pipelines:
      logs:
        receivers: [filelog]
        processors: [resource]
        exporters: [loki]

Set up the Kubernetes integration in Grafana Cloud

The Kubernetes integration comes with a set of predefined recording and alerting rules. To install them, navigate to the Kubernetes integration configuration page located at Observability -> Kubernetes -> Configuration. To install the components, click the Install button.

After these steps, you can see your resources and metrics in the Kubernetes Integration.

Troubleshoot absence of resources

If the Kubernetes integration shows no resources, navigate to the Explore page in Grafana and enter the following query:

up{cluster="your-cluster-name"}

This query should return at least one series for each of the scrape targets defined previously. If you do not see any series or some of the series have a value of 0, enable debug logging in the OpenTelemetry Collector with the following config snippet:

service:
  telemetry:
    logs:
      level: 'debug'

If you can see the collected metrics but the Kubernetes integration does not list your resources, make sure that each time series has a cluster label set, and the job label matches the names in the preceding configuration.

Was this page helpful?

Email docs@grafana.com

Help and support

Community

Feedback

Relevant sources:

Feedback

Send Kubernetes metrics, logs, and events using the OpenTelemetry Collector

Before you begin

Configuration steps

Set up exporters

Set up `kube-state-metrics`

Set up `node_exporter`

Configure the OpenTelemetry Metrics Collector

Add scraping endpoints

Set up RBAC for OpenTelemetry Metrics Collector

Configure the remote write exporter

Link collected metrics

Configure the OpenTelemetry Logs Collector

Configure the OpenTelemetry Collector for Cluster events

Add the Kubernetes events integration

Set up RBAC for OpenTelemetry Collector

Configure the exporter

Link collected events

Full example

Deployment values

DaemonSet configuration

Set up the Kubernetes integration in Grafana Cloud

Troubleshoot absence of resources

Was this page helpful?

Related documentation

Send Kubernetes metrics, logs, and events using the OpenTelemetry Collector

Before you begin

Configuration steps

Set up exporters

Set up kube-state-metrics

Set up node_exporter

Configure the OpenTelemetry Metrics Collector

Add scraping endpoints

Set up RBAC for OpenTelemetry Metrics Collector

Configure the remote write exporter

Link collected metrics

Configure the OpenTelemetry Logs Collector

Configure the OpenTelemetry Collector for Cluster events

Add the Kubernetes events integration

Set up RBAC for OpenTelemetry Collector

Configure the exporter

Link collected events

Full example

Deployment values

DaemonSet configuration

Set up the Kubernetes integration in Grafana Cloud

Troubleshoot absence of resources

Was this page helpful?

Related documentation

Related resources from Grafana Labs

Set up `kube-state-metrics`

Set up `node_exporter`