Troubleshoot Kubernetes Monitoring
Grafana Cloud

Troubleshoot Kubernetes Monitoring

This section includes common errors encountered while installing and configuring Kubernetes Monitoring components and tools you can use to troubleshoot.

Couldn’t load repositories file

If you receive the following message when running the chart installation generated by Grafana Cloud Error: Couldn't load repositories file (/root/.helm/repository/repositories.yaml). then run helm init. This is a common error for new installations of Kubernetes and K3s.

Duplicate metrics

Certain metric data sources (such as Node Exporter or kube-state-metrics) may already exist on the Cluster. When you deployed with the Kubernetes Monitoring Helm chart, these data sources are installed even if they were already present on your Cluster.

  1. Visit the Metrics status tab to view any duplicates.
  2. Remove the duplicates or adjust the Helm chart values to use the existing ones and skip deploying another instance.

Invalid argument 300s

If you receive the following message when running the chart installation generated by Grafana Cloud Error: invalid argument 300s for --timeout flag: strconv.ParseInt: parsing 300s: invalid syntax, then you’re using an old version of Helm. Update to version 3.

Kubernetes cluster unreachable

For K3s deployments, if you receive the following message when running the chart installation generated by Grafana Cloud Error: Kubernetes cluster unreachable: Get http://localhost:8080/version: dial tcp 127.0.0.1:8080: connect: connection refused, then execute the following command before you run Helm: export KUBECONFIG=/etc/rancher/k3s/k3s.yaml.

Missing data

Here are some tips for missing data.

Debug Metrics

For any panel, click the menu icon and select Debug metrics for this panel.

Accessing the menu for the panel to show the menu options
Accessing the menu for the panel to show the menu options

Debug Metrics lists all metrics used for the panel along with any errors found.

Debug Metrics for the panel
Debug Metrics for the panel

CPU usage negative and missing data

If you have not installed Kubernetes Monitoring with the Helm chart and instead used the OTel collector deployed as a DaemonSet, you could have issues with CPU usage data. The OTel collector should be deployed as a Deployment. By using a DaemonSet, multiple samples may be written out of order to the same time series. This can cause Kubernetes Monitoring to show:

  • Negative rates for CPU usage
  • Gaps in usage showing on Optimization panels
  • Unevenly spaced data points indicative of multiple sample ingestion, which may also be interpreted as counter resets

CPU usage panels missing data

If there is no CPU usage data, the data scraping intervals of the collector and the data source may not match. The default scraping interval for Grafana Alloy is 60 seconds. If the scraping interval for your data source is not 60 seconds, this mismatch may interfere with the calculation for CPU rate of usage.

To resolve, synchronize the scraping interval for the collector and data source.

  • If you configured the data source (meaning it wasn’t automatically provisioned by Grafana Cloud), change the scrape interval for the data source to match the collector.
  • If the data source was provisioned for you by Grafana Cloud, contact support to request the scrape interval for the data source be changed to match the collector.

Data missing in a panel

If a panel in Kubernetes Monitoring seems to be missing data or shows a “No data” message, you can use either the Debug Metrics feature or open the query for the panel in Explore to determine which query is failing.

This can occur when new features are released. For example, if you see no data in the network bandwidth and saturation panels, it is likely you need to upgrade to the newest version of the Helm chart.

Data missing for a provider

If your cloud service provider name is not showing up in the Cluster list page, it’s likely due to a provider_id missing from some types of Clusters. This occurs in the case of an internal provider or bare metal Clusters. To ensure your provider shows up, create a relabeling rule for the provider. metrics:

kube-state-metrics:
  extraMetricRelabelingRules: |-
    rule {
      source_labels = ["__name__", "provider_id", "node"]
      separator = "@"
      regex = "kube_node_info@@(.*)"
      replacement = "<cluster provider id>://${1}"
      action = "replace"
      target_label = "provider_id"
    }

Replace <cluster provider id> with the provider ID you would like to appear in the Kubernetes Monitoring Cluster list page.

Efficiency usage data missing

If CPU and memory usage within any table shows no data, it could be due to missing Node Exporter metrics. Navigate to the Metrics status tab to determine what is not being reported.

Metrics missing

If metrics are missing even though the Metrics status tab is showing that the configuration is set up as you intended, check for an incorrectly configured label for the Node Exporter instance.

Make sure the Node Exporter instance label is set to the Node name. The labels for kube-state-metrics node and Node Exporter instance must contain the same values.

Methodology for missing metrics

It’s helpful to keep in mind the different phases of metrics gathering when debugging.

Discovery

Find the metric source. In this phase, find out whether the tool to gather metrics is working. For example, is Node Exporter running? Can Alloy find Node Exporter? Perhaps there’s configuration that is incorrect because Alloy is looking in a namespace or for a specific label.

Scraping

Ask whether the metrics were gathered correctly. As an example, most metric sources use HTTP, but the metric source you are trying to find uses HTTPS. Identify whether the configuration is set for scraping HTTPS.

Processing

Ask whether metrics were correctly processed. With Kubernetes Monitoring, metrics are filtered to a small subset of the useful metrics.

Delivery

In this phase, metrics are sent to Grafana Cloud. If there is an issue, there are likely no metrics being delivered. This can occur if your account limits for metrics is reached. Check the Usage Insights - 5 - Metrics Ingestion dashboard.

List of Grafana Cloud dashboards with Metrics Ingestion dashboard highlighted
List of Grafana Cloud dashboards with Metrics Ingestion dashboard highlighted

Displaying

In this phase, a metric is not showing up in the Kubernetes Monitoring GUI. If you’ve determined the metrics are being delivered but some are not displaying, there may be a missing or incorrect label for the metric. Check the Metrics status tab.

Pod logs missing

If you are not seeing Pod logs and your platform is AWS EKS Fargate, these logs cannot be gathered using a hostpath volume mount. Instead, you can use API-based log gathering. For greater detail, refer to EKS Fargate.

Network metrics missing

If you have deployed on the AWS EKS Fargate platform, AWS prevents a level of access that Node Exporter requires to gather metrics for the network panels. EKS Fargate provides on-demand compute for Kubernetes objects instead of the traditional means where these objects run on Nodes.

Port conflicts and Node Exporter

Node Exporter opens a host networking port. If there already is a Node exporter being used, the two exporters experience conflict with their respective default ports. To disable the Helm chart’s Node Exporter and use the existing exporter, customize the Helm chart by adding the following to the yaml file:

yaml
clusterMetrics:
  node-exporter
    enabled: true
    deploy: false

Workload data missing

If you are seeing Pod resource usage but not workloads usage data, the recording rules and alert rules are likely not installed.

  1. Navigate to the Configuration page.
  2. Click the Metrics status tab.
  3. In the Workload Recording Rule panel, click Install to install alert rules and recording rules.

Troubleshooting deployment with Helm chart

Two common issues often occur when a Helm chart is not configured correctly:

  • Duplicate metrics
  • Missing metrics

If you have configured Kubernetes Monitoring with the Grafana Kubernetes Monitoring Helm chart, here are some general troubleshooting techniques:

  • Within Kubernetes Monitoring, view the metrics status.
  • Check for any changes with the command helm template .... This produces an `output.yaml’ file to check the result.
  • Check the configuration with the command helm test --logs. This provides a configuration validation, including all phases of metrics gathering through display.
  • Check the extraConfig section of the Helm chart to ensure this section is not used for modifications. This section is only for additional configuration not already in the chart, and not for modifications to the chart.

View metrics status

To view the status of metrics being collected:

  1. Click Configuration on the menu.
  2. Click the Metrics status tab.
  3. Filter for the Cluster or Clusters you want to see the status of.
**Metrics status** tab with status indicators for one Cluster
Metrics status tab with status indicators for one Cluster

Status icons

Each panel of the Metrics status shows an icon that indicates the status of the incoming data, based on the selected data source, Cluster, and time range:

  • Check mark in a circle (green): Data for this source is being collected. The version of the source or online status also displays (if available).
  • Caution with exclamation mark (yellow): Duplicate data is being collected for the metric source.
  • X in a circle (red): There is no data available for this item within the time range specified, and it appears to be offline.
**Metrics status** panel with icon warning of multiple metrics
Metrics status panel with icon warning of multiple metrics

Check initial configuration

When you initially configure, if any box shows a red X in a circle, it can be any of the following:

  • The feature was not selected during Cluster configuration.
  • The system is not running correctly.
  • Alloy was not able to gather data correctly.
  • No data was gathered during the time range specified.

View the query with Explore

If something in the metrics status looks incorrect, click the icon next to the panel title. This opens the query in Explore where you can examine the query for any issues, such as an incorrect label.

Look at a historical time range

Use the time range selector to understand what was occurring in the past. In the following example, Cluster events were being collected but are not currently.

Time range of last two days for **Metrics status**
Time range of last two days for Metrics status

View documentation for each status

For more information about each status, click the Docs link in each panel.

Error messages

Here are tips for errors you may receive related to configuration.

OpenShift error

With OpenShift’s default SecurityContextConstraints (scc) of restricted (refer to the scc documentation for more info), you may run into the following errors while deploying Grafana Alloy using the default generated manifests:

msg="error creating the agent server entrypoint" err="creating HTTP listener: listen tcp 0.0.0.0:80: bind: permission denied"

By default, the Alloy StatefulSet container attempts to bind to port 80, which is only allowed by the root user (0) and other privileged users. With the default restricted SCC on OpenShift, this results in the preceding error.

Events:
  Type     Reason        Age                   From                  Message
  ----     ------        ----                  ----                  -------
  Warning  FailedCreate  3m55s (x19 over 15m)  daemonset-controller  Error creating: pods "grafana-agent-logs-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.runAsUser: Invalid value: 0: must be in the ranges: [1000650000, 1000659999], spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

By default, the Alloy DaemonSet attempts to run as root user, and also attempts to access directories on the host (to tail logs). With the default restricted SCC on OpenShift, this results in the preceding error.

To solve these errors, use the hostmount-anyuid SCC provided by OpenShift, which allows containers to run as root and mount directories on the host.

If this does not meet your security needs, create a new SCC with the required tailored permissions, or investigate running Agent as a non-root container, which goes beyond the scope of this troubleshooting guide.

To use the hostmount-anyuid SCC, add the following stanza to the alloy and alloy-logs ClusterRoles:

yaml
. . .
- apiGroups:
  - security.openshift.io
  resources:
  - securitycontextconstraints
  verbs:
  - use
  resourceNames:
  - hostmount-anyuid
. . .

Update error

If you attempted to upgrade Kubernetes Monitoring with the Update button on the Cluster configuration tab under Configuration and received an error message, complete the following instructions.

Warning

When you uninstall Grafana Alloy, this deletes its associated alert and recording rule namespace. Alerts added to the default locations are also removed. Save a copy of any customized item if you modified the provisioned version.

  1. Click Uninstall.
  2. Click Install to reinstall.
  3. Complete the instructions in Configure with Grafana Kubernetes Monitoring Helm chart.