Menu
Grafana Cloud

Send Kubernetes metrics and logs using the OpenTelemetry Collector

If you currently have an OpenTelemetry Collector-based system in your Cluster, use these instructions.

Note: If you do not have an OpenTelemetry Collector-based system set up in your Cluster, consider configuring with Grafana Agent Flow mode instead, as it offers more features and easier integration.

These instructions include:

  • Using the OpenTelemetry Collector to send metrics to Grafana Cloud.
  • Enabling logs with the OpenTelemetry Logs Collector
  • Enabling the eventhandler integration.

After connecting, you will see your resources, as well as their metrics and logs, in the Grafana Cloud Kubernetes integration.

Before you begin

Before you begin the configuration steps, have the following available:

  • A Kubernetes Cluster with role-based access control (RBAC) enabled
  • A Grafana Cloud account. To create an account, navigate to Grafana Cloud, and click Create free account.
  • The kubectl command-line tool installed on your local machine, configured to connect to your Cluster
  • The helm command-line tool installed on your local machine. If you already have working kube-state-metrics and node-exporter instances installed in your Cluster, skip this step.
  • A working OpenTelemetry Collector deployment. For more information, refer to the OpenTelemetry Collector documentation.

Configuration steps

Follow these steps to configure sending metrics and logs:

  1. Set up the exporters for metrics.
  2. Configure the OpenTelemetry Collector for metrics.
  3. Configure the remote write exporter to send metrics to Grafana Cloud.
  4. Configure the OpenTelemetry Collector for logs.
  5. Set up the Kubernetes integration in Grafana Cloud.

Set up exporters

The Grafana Cloud Kubernetes integration requires metrics from specific exporters. Some are embedded in the kubelet and others require deployment. These are:

  • Kubelet metrics for utilization and efficiency analysis (embedded)
  • cAdvisor for usage statistics on a container level (embedded)
  • kube-state-metrics to display available resources (requires deployment)
  • node-exporter for Node-level metrics (requires deployment)

Due to differences in the metrics returned by the integrated Kubernetes Cluster Receiver, Kubelet Stats Receiver and Kubernetes Events Receiver, these cannot be used by Grafana Kubernetes Monitoring.

If you already have kube-state-metrics and node_exporter instances deployed in your Cluster, skip the next two steps.

Set up kube-state-metrics

Run the following commands from your shell to install kube-state-metrics into the default namespace of your Kubernetes Cluster:

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install ksm prometheus-community/kube-state-metrics --set image.tag="v2.8.2" -n "default"

To deploy kube-state-metrics into a different namespace, replace default in the command above with a different value.

Set up node_exporter

Run the following commands from your shell to install node_exporter into the default namespace of your Kubernetes Cluster:

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install nodeexporter prometheus-community/prometheus-node-exporter -n "default"

This will create a DaemonSet to expose metrics on every Node in your Cluster.

To deploy the node_exporter into a different namespace, replace default in the command above with a different value.

Configure the OpenTelemetry Metrics Collector

To configure the OpenTelemetry Collector:

  • Add targeted endpoints for scraping.
  • Include the remote write exporter to send metrics to Grafana Cloud.
  • Link collected metrics to the remote write exporter.

Add scraping endpoints

Add the following to your OpenTelemetry Collector configuration. The configuration can usually be found in a ConfigMap.

yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          job_name: integrations/kubernetes/cadvisor
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - replacement: kubernetes.default.svc.cluster.local:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
              source_labels:
                - __meta_kubernetes_node_name
              target_label: __metrics_path__
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
            server_name: kubernetes
        - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          job_name: integrations/kubernetes/kubelet
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - replacement: kubernetes.default.svc.cluster.local:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$${1}/proxy/metrics
              source_labels:
                - __meta_kubernetes_node_name
              target_label: __metrics_path__
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: false
            server_name: kubernetes
        - job_name: integrations/kubernetes/kube-state-metrics
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: keep
              regex: kube-state-metrics
              source_labels:
                - __meta_kubernetes_pod_label_app_kubernetes_io_name
        - job_name: integrations/node_exporter
          kubernetes_sd_configs:
            - namespaces:
                names:
                  - default
              role: pod
          relabel_configs:
            - action: keep
              regex: prometheus-node-exporter.*
              source_labels:
                - __meta_kubernetes_pod_label_app_kubernetes_io_name
            - action: replace
              source_labels:
                - __meta_kubernetes_pod_node_name
              target_label: instance
            - action: replace
              source_labels:
                - __meta_kubernetes_namespace
              target_label: namespace

This configuration adds four scrape targets with specific functions for discovery and scraping:

  1. All Nodes, scraping their cAdvisor endpoint (integrations/kubernetes/cadvisor)
  2. All Nodes, scraping their Kubelet metrics endpoint (integrations/kubernetes/kubelet)
  3. All Pods with the app.kubernetes.io/name=kube-state-metrics label, scraping their /metrics endpoint (integrations/kubernetes/kube-state-metrics)
  4. All Pods matching the name prometheus-node-exporter.* in the default namespace, scraping their /metrics endpoint (integrations/node_exporter)
Warning: For the Kubernetes integration to work correctly, these job labels must match exactly for you to be able to see your Cluster in the dashboards.

To reduce the amount of metrics sent to Grafana Cloud, add the following to every scrape target:

yaml
metric_relabel_configs:
  - source_labels: [__name__]
    action: keep
    regex: 'kubelet_running_containers|go_goroutines|kubelet_runtime_operations_errors_total|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|kubelet_volume_stats_inodes_used|kubelet_certificate_manager_server_ttl_seconds|namespace_workload_pod:kube_pod_owner:relabel|kubelet_node_config_error|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_limits:sum|container_memory_working_set_bytes|container_fs_reads_bytes_total|kube_node_status_condition|namespace_cpu:kube_pod_container_resource_requests:sum|kubelet_server_expiration_renew_errors|container_fs_writes_total|kube_horizontalpodautoscaler_status_desired_replicas|node_filesystem_avail_bytes|kube_pod_status_reason|node_filesystem_size_bytes|kube_deployment_spec_replicas|kube_statefulset_metadata_generation|namespace_workload_pod|storage_operation_duration_seconds_count|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_container_resource_limits|kube_statefulset_status_replicas_updated|node_namespace_pod_container:container_memory_rss|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_pleg_relist_interval_seconds_bucket|kube_job_status_start_time|kube_deployment_status_observed_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_cache|kube_resourcequota|kube_horizontalpodautoscaler_spec_min_replicas|namespace_memory:kube_pod_container_resource_requests:sum|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_daemonset_status_number_available|kube_job_failed|storage_operation_errors_total|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|container_fs_writes_bytes_total|kube_statefulset_replicas|kube_replicaset_owner|container_network_receive_bytes_total|volume_manager_total_volumes|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_desired_number_scheduled|kube_pod_container_status_waiting_reason|process_cpu_seconds_total|kube_node_status_allocatable|kube_deployment_status_replicas_available|kube_daemonset_status_updated_number_scheduled|container_network_receive_packets_total|container_memory_rss|container_cpu_usage_seconds_total|kube_namespace_status_phase|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kubelet_volume_stats_available_bytes|kube_deployment_status_replicas_updated|kubelet_running_container_count|kube_node_info|container_network_transmit_packets_dropped_total|kubelet_certificate_manager_client_ttl_seconds|kube_pod_owner|kubelet_volume_stats_inodes|kubelet_runtime_operations_total|container_cpu_cfs_throttled_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_running_pod_count|container_network_transmit_packets_total|kubelet_node_name|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_ready|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_node_spec_taint|kubelet_pleg_relist_duration_seconds_bucket|kube_pod_status_phase|container_cpu_cfs_periods_total|kube_deployment_metadata_generation|node_namespace_pod_container:container_memory_cache|kube_statefulset_status_current_revision|kubelet_pleg_relist_duration_seconds_count|container_fs_reads_total|kube_statefulset_status_update_revision|container_network_receive_packets_dropped_total|kube_pod_info|kubelet_running_pods|process_resident_memory_bytes|kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_count|kube_node_status_capacity|container_network_transmit_bytes_total|rest_client_requests_total|kubernetes_build_info|machine_memory_bytes|kube_statefulset_status_replicas|container_memory_swap|kube_job_status_active|kubelet_pod_start_duration_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|node_namespace_pod_container:container_memory_swap|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*'

This will filter out any unnecessary metrics to reduce the amount of active series.

Set up RBAC for OpenTelemetry Metrics Collector

This configuration uses the built-in Kubernetes service discovery, so you must set up the service account running the OpenTelemetry Collector with advanced permissions (compared to the default set). The following ClusterRole provides a good starting point:

yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups:
      - ''
    resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
      - events
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /metrics
    verbs:
      - get

To bind this to a ServiceAccount, use the following ClusterRoleBinding:

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector
subjects:
  - kind: ServiceAccount
    name: otel-collector # replace with your service account name
    namespace: default # replace with your namespace
roleRef:
  kind: ClusterRole
  name: otel-collector
  apiGroup: rbac.authorization.k8s.io

Configure the remote write exporter

To send the metrics to Grafana Cloud, add the following to your OpenTelemetry Collector configuration:

yaml
exporters:
  prometheusremotewrite:
    external_labels:
      cluster: 'your-cluster-name'
    endpoint: 'https://PROMETHEUS_USERNAME:ACCESS_POLICY_TOKEN@PROMETHEUS_URL/api/prom/push'

To retrieve your connection information:

  1. Go to your Grafana Cloud account.
  2. Select the correct organization in the dropdown menu.
  3. Select your desired stack in the main navigation on the left.
  4. Click the Send Metrics button on the Prometheus card. You will find your connection information on the page that displays.

For the token, it is recommended that you:

Link the collected metrics to the remote write exporter. It is also good practice to add a batch processor, which will improve performance.

Add the following to the OpenTelemetry Collector configuration:

yaml
processors:
  batch:
service:
  pipelines:
    metrics/prod:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheusremotewrite]

After restarting your OpenTelemetry Collector, you should see the first metrics arriving in Grafana Cloud after a few minutes.

Configure the OpenTelemetry Logs Collector

Kubernetes writes logs to a specific file on the respective Node, so you must schedule a Pod on each Node to scrape these files. Do this with a separate DaemonSet.

The following configuration file configures the OpenTelemetry Collector to scrape logs from the default logging location for Kubernetes. Make sure you use the same Cluster name as with your metrics, otherwise the correlation won’t work.

yaml
# This is a new configuration file - do not merge this with your metrics configuration!
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: beginning
    include_file_path: true
    include_file_name: false
    operators:
      # Find out which format is used by kubernetes
      - type: router
        id: get-format
        routes:
          - output: parser-docker
            expr: 'body matches "^\\{"'
          - output: parser-crio
            expr: 'body matches "^[^ Z]+ "'
          - output: parser-containerd
            expr: 'body matches "^[^ Z]+Z"'
      # Parse CRI-O format
      - type: regex_parser
        id: parser-crio
        regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout_type: gotime
          layout: '2006-01-02T15:04:05.999999999Z07:00'
      # Parse CRI-Containerd format
      - type: regex_parser
        id: parser-containerd
        regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Parse Docker format
      - type: json_parser
        id: parser-docker
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      - type: move
        from: attributes.log
        to: body
      # Extract metadata from file path
      - type: regex_parser
        id: extract_metadata_from_filepath
        regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
        parse_from: attributes["log.file.path"]
        cache:
          size: 128 # default maximum amount of Pods per Node is 110
      # Rename attributes
      - type: move
        from: attributes["log.file.path"]
        to: resource["filename"]
      - type: move
        from: attributes.container_name
        to: resource["container"]
      - type: move
        from: attributes.namespace
        to: resource["namespace"]
      - type: move
        from: attributes.pod_name
        to: resource["pod"]
      - type: add
        field: resource["cluster"]
        value: 'your-cluster-name' # Set your cluster name here

processors:
  resource:
    attributes:
      - action: insert
        key: loki.format
        value: raw
      - action: insert
        key: loki.resource.labels
        value: pod, namespace, container, cluster, filename
exporters:
  loki:
    endpoint: https://LOKI_USERNAME:ACCESS_POLICY_TOKEN@LOKI_URL/loki/api/v1/push
service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [resource]
      exporters: [loki]

When you configure the DaemonSet, you must mount the correct directories for the collector to access the logs. For a detailed example, refer to the example deployment in the opentelemetry-collector-contrib repository.

Set up the Kubernetes integration in Grafana Cloud

The Kubernetes integration comes with a set of predefined dashboards and recording/alerting rules. To install them, navigate to the Kubernetes integration configuration page located at Observability -> Kubernetes -> Configuration. To install the components, click the Install dashboards and alert rules button.

After these steps, you will see your resources and metrics in the Kubernetes Integration.

Troubleshoot absence of resources

If the Kubernetes integration shows no resources, navigate to the Explore page in Grafana and enter the following query:

promql
up{cluster="your-cluster-name"}

This query should return at least one series for each of the scrape targets defined above. If you do not see any series or some of the series have a value of 0, enable debug logging in the OpenTelemetry Collector with the following config snippet:

yaml
service:
  telemetry:
    logs:
      level: 'debug'

If you can see the collected metrics but the Kubernetes integration does not list your resources, make sure that each time series has a cluster label set and the job label matches the names in the configuration above.

Set up Kubernetes Event monitoring (beta)

Grafana Agent bundles an eventhandler integration that watches for Kubernetes Events in your clusters and ships these to Grafana Cloud Loki. Kubernetes controllers emit Events as they perform operations in your cluster (like starting containers, scheduling Pods, etc.) and these can be a rich source of logging information to help you debug, monitor, and alert on your Kubernetes workloads. Generally, these Events can be queried using kubectl get event or kubectl describe; with the eventhandler integration enabled, you can query these directly from Grafana Cloud.

Before you begin

To begin, you need the following:

  • A Kubernetes Cluster
  • The kubectl command-line tool installed and available on your machine
  • A Grafana Cloud account or Loki instance that will receive log entries

Deployment options

The eventhandler integration is one of several integrations embedded directly into Grafana Agent. You can run the integration in several ways:

  • A dedicated Grafana Agent StatefulSet running only the eventhandler integration
  • As part of an existing Agent StatefulSet
Note: Although you can run the integration without persistent storage, we recommend running it with dedicated disk storage (StatefulSet or Deployment with PersistentVolume & PersistentVolumeClaim) to take advantage of its caching feature. Kubernetes events have a lifespan of an hour; after an hour, they are deleted from the Cluster’s internal key-value store. If you restart the integration within an hour of it going down, eventhandler will re-ship any Events present in the Cluster’s internal store unless the cache file is provided.

Option 1: Run a dedicated eventhandler

To run a dedicated eventhandler StatefulSet and for full documentation and configuration instructions, refer to eventhandler_config from the Grafana Agent documentation. These docs provide sample manifests and configuration for an Agent StatefulSet running only the eventhandler integration.

You can also use a Deployment with a PersistentVolume and PersistentVolumeClaim or use Node-local storage, but these methods are outside the scope of this guide and require modifying the provided manifests and instructions.

Option 2: Enable eventhandler in an existing Agent Deployment or StatefulSet

To enable the eventhandler integration in an existing Grafana Agent setup or to avoid running another Agent in your Cluster, you can modify your existing Agent’s configuration to enable the integration.

Note: If you’re using a Deployment you should attach persistent disk storage and appropriately configure the integration’s cache_path to take advantage of eventhandler’s Event caching. This isn’t necessary but will prevent double-shipping Cluster Events to Loki in the event of an Agent restart. To learn more about configuring a PersistentVolume for storage, refer to Configure a Pod to Use a PersistentVolume for Storage.
  1. Enable the integration

    Modify your existing Agent configuration by adding the following stanza to your Agent’s agent.yaml or ConfigMap:

    yaml
    server:
      . . .
    metrics:
      . . .
    integrations:
      eventhandler:
        cache_path: "/etc/eventhandler/eventhandler.cache"
        logs_instance: "default"
      . . .

    This block enables the integration and instructs it to cache the last Event shipped at the path provided by cache_path. For a full configuration reference, refer to eventhandler_config from the Agent documentation.

  2. Enable the logs instance.

    Add the following block of Agent logs config:

    yaml
    server: . . .
    metrics: . . .
    integrations:
      ## see above
      . . .
    logs:
      configs:
        - name: default
          clients:
            ## you may need to replace this with a different endpoint
            - url: https://logs-prod-us-central1.grafana.net/api/prom/push
              basic_auth:
                username: YOUR_LOKI_USER
                password: YOUR_LOKI_ACCESS_POLICY_TOKEN
              external_labels:
                cluster: 'cloud'
                job: 'integrations/kubernetes/eventhandler'
          positions:
            filename: /tmp/positions0.yaml

    This block enables an instance of Agent’s logs subsystem (embedded promtail) and configures it with the appropriate Loki credentials:

    • default determines where Events get shipped as Loki log lines. You can also set default labels on log lines using the external_labels parameter. The name must match logs_instance in the integrations config block.

    For full logs_config reference, refer to logs_config from the Agent docs.

    You can find your Loki credentials in your org’s Grafana Cloud Portal.

  3. Run eventhandler

    To run eventhandler, you need to pass in the following flag when you run Agent:

    bash
    -enable-features=integrations-next

    This enables the latest version of the Agent integration subsystem. To learn more, refer to Integrations Revamp.

    A full Kubernetes container spec should be similar to this one:

    yaml
    containers:
      - name: agent
        image: grafana/agent:latest
        imagePullPolicy: IfNotPresent
        args:
          - -config.file=/etc/agent/agent.yaml
          - -enable-features=integrations-next
        command:
          - /bin/grafana-agent
        env:
          - name: HOSTNAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        ports:
          - containerPort: 12345
            name: http-metrics
        volumeMounts:
          ## Should use a ConfigMap volume, stores Agent config
          - name: grafana-agent
            mountPath: /etc/agent
          ## Optional, but should use a persistent volume, stores Event cache
          - name: eventhandler-cache
            mountPath: /etc/eventhandler

    You should modify these parameters depending on your architecture and configured PersistentVolumes and ConfigMaps.

  4. Add ClusterRole events permission

    You also need to allow Agent’s ClusterRole to access the events resource from K8s API:

    yaml
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: grafana-agent
    rules:
      - apiGroups:
          - ''
        resources:
          - nodes
          - nodes/proxy
          - services
          - endpoints
          - pods
          ## added "events" here
          - events
        verbs:
          - get
          - list
          - watch
      - nonResourceURLs:
          - /metrics
        verbs:
          - get

    eventhandler only requires get list watch for the events resource, but for clarity we’ve appended the required permission to the default ClusterRole provided by the K8s integration (which also allows Prometheus service discovery).

Please surface any issues with this integration in the Grafana Agent GitHub Repo or on the Grafana Labs Community Slack (in #agent).

eventhandler is enabled by default in the latest version of the Kubernetes Monitoring agent manifests.