Billing and usageControl Prometheus metrics usageAnalyzing and reducing metrics usage with Grafana Mimirtool

Analyzing and reducing metrics usage with Grafana Mimirtool

In this guide you’ll learn how to use Grafana Mimirtool to identify high-cardinality metrics that are not referenced in your Grafana dashboards or Prometheus rules. Using this list of “unused metrics,” you can then leverage Prometheus’s relabel config feature to drop metrics and labels that you might not want to ship to Grafana Cloud for long-term storage and querying. This can help you reduce your active series usage and monthly bill.

Grafana Mimirtool supports extracting metrics from:

  • Grafana dashboards in a Grafana instance
  • Prometheus alerting and recording rules in a Cloud Prometheus instance
  • Grafana dashboard JSON files
  • Prometheus recording and alerting rule YAML files

Grafana Mimirtool can then compare these extracted metrics to active series in a Prometheus or Cloud Prometheus instance, and output a list of “used” metrics and “unused” metrics:

  • “Used metrics” are metrics that are referenced in a dashboard or rule that you are actively shipping to Grafana Cloud
  • “Unused metrics” are metrics that are not referenced in a dashboard or rule that you are actively shipping to Grafana Cloud

Warning: There are some metrics that you might not have in a dashboard or rule that you may want to query or analyze in the future, especially during an incident. Keep this in mind when choosing which metrics to keep or drop.

This guide covers an end-to-end example of extracting metrics from a dashboard, finding “used” metrics, and generating a relabel_config to only keep those and drop everything else.

Prerequisites

Before you begin with this guide, have the following things available:

  • Grafana Mimirtool installed and available on your machine. To learn how to install Grafana Mimirtool, see Installation.
  • A Grafana Cloud account. To create an account, see Grafana Cloud and click on Start for free.
  • A Grafana Cloud API key with the Admin role.
  • An API key for your managed Grafana instance. You can learn how to create a Grafana API key in Create API token.
  • Prometheus or Grafana Agent installed in your environment and configured to ship metrics to Grafana Cloud.

Step 1: Identify metrics referenced in Grafana dashboards

In this step you’ll use mimirtool analyze grafana to extract metrics that are referenced in your managed Grafana dashboards.

Grafana Mimirtool includes two built-in commands for extracting metrics from dashboards:

  • mimirtool analyze grafana, which fetches dashboards from a managed Grafana or OSS Grafana instance
  • mimirtool analyze dashboard, which extracts metrics from dashboard JSON files

To begin, make sure that you are shipping some metrics to your Cloud Prometheus endpoint, and have some dashboards in your managed Grafana instance. If you haven’t already, create an API key for your managed Grafana instance:

Click on the Configuration menu item in the left-hand nav of your managed Grafana instance. Then, click on API Keys:

API key menu item

Create an API key with the Viewer role:

Create API key view

Using your managed Grafana API key, run mimirtool analyze grafana to extract metrics from your Grafana dashboards. Note that this API key is different from your Grafana Cloud API key, which configures authentication for Cloud Prometheus.

mimirtool analyze grafana --address=https://your_stack_name.grafana.net --key=<Your Grafana API key>

Using your managed Grafana API key, run mimirtool analyze grafana to extract metrics from your Grafana dashboards.

Note: This API key is different from your Grafana Cloud API key, which configures authentication for Cloud Prometheus.

mimirtool analyze grafana --address=https://your_stack_name.grafana.net --key=<Your Grafana API key>

mimirtool downloads dashboards from your managed Grafana instance and parses out metrics that are referenced in the dashboard’s PromQL queries. It then saves the ouput in a file called metrics-in-grafana.json. If it encounters errors while parsing a dashboard, the errors are stored in a parse_errors field of the JSON output.

The output looks similar to something like this:

{
  "metricsUsed": [
    "grafanacloud_instance_active_series",
    "grafanacloud_instance_info",
    "grafanacloud_instance_samples_discarded_per_second",
    "grafanacloud_instance_samples_per_second",
    "grafanacloud_logs_instance_bytes_received_per_sec"
   ],
   "dashboards": [
   {
    "slug": "",
    "uid": "LximWqMnz",
    "title": "Grafana Cloud Billing/Usage",
    "metrics": [
      "grafanacloud_instance_active_series",
      "grafanacloud_instance_info",
      "grafanacloud_instance_samples_discarded_per_second",
      "grafanacloud_instance_samples_per_second",
      "grafanacloud_logs_instance_bytes_received_per_second",
      "grafanacloud_logs_instance_info"
    ],
    "parse_errors": null
    },
    {
      "slug": "",
      "uid": "zxHdvqM7z",
      "title": "Nodes",
      "metrics": [
        "node_cpu_seconds_total",
        "node_disk_io_time_seconds_total",
        "node_disk_read_bytes_total"

The metricsUsed object contains all of the metrics that are referenced across all dashboards. Each dashboard also has a metrics array with its extracted metrics.

You can follow a similar procedure to extract metrics from:

  • Dashboard JSON files directly (analyze dashboard)
  • Cloud Prometheus rules (analyze ruler)
  • Prometheus rule YAML files (analyze rule-file)

To learn more about these commands, see Grafana Mimirtool.

With this list of referenced metrics in place, we can move on to using analyze prometheus to identify active metrics that are not on this list.

With this list of referenced metrics, you can use analyze prometheus to identify active metrics that are not in this list.

Step 2: Identify unused active metrics

In this step, you can use mimirtool analyze prometheus to identify active series that are not referenced in any dashboard or rule.

Warning: There are some metrics that you might not have in a dashboard or rule that you might want to query or analyze in the future, especially during an incident. Keep this in mind when choosing which metrics to keep or drop.

Note that analyze prometheus uses the JSON output from the previous step to determine metrics that are “used.” If you pass in both metrics-in-grafana.json and metrics-in-ruler.json files, it’ll construct one array of “used” metrics.

analyze prometheus can work against any Prometheus API (including Cloud Prometheus and OSS Prometheus).

Run the command with the appropriate parameters:

mimirtool analyze prometheus --address=<Your Cloud Prometheus query endpoint> --id=<Your Cloud Prometheus instance ID> --key=<Your Cloud API key> --log.level=debug

You can find your Prometheus query endpoint and instance ID from the Prometheus panel of the Cloud Web Portal. To learn how to create an API key for your Cloud Prometheus endpoint, see Create an API key.

By default, this command looks for a local file called metrics-in-grafana.json. You can modify the default behavior with the grafana-metrics-file flag. To enable more verbose output, use the --log.level=debug flag.

Depending on your metric volume, the command might take several minutes to run. When it’s done, you should see something like the following example:

INFO[0000] Found 243 metric names
INFO[0003] 57 active series are being used in dashboards
INFO[0019] 395 active series are NOT being used in dashboards

This indicates that mimirtool found 243 unique metric names.

Note: A given metric can have multiple labels, and an active series is a unique combination of metric name and one or more labels. To learn more, see Prometheus time series.

If you only want to keep metrics that are referenced in your hosted Grafana dashboards, drop your active series usage by 395.

Inspect the command’s output file, prometheus-metrics.json:

{
  "total_active_series": 452,
  "in_use_active_series": 57,
  "additional_active_series": 395,
  "in_use_metric_counts": [
    {
      "metric": "node_cpu_seconds_total",
      "count": 16,
     	"job_counts": [
        {
          "job": "integrations/node_exporter",
          "count": 16
        }
      ]
    },
   ]
    "additional_metric_counts": [
    {
      "metric": "node_scrape_collector_success",
      "count": 39,
      "job_counts": [
        {
          "job": "integrations/node_exporter",
          "count": 39
        }
      ]
    },
    {
      "metric": "node_scrape_collector_duration_seconds",
      "count": 39,
      "job_counts": [
        {
          "job": "integrations/node_exporter",
          "count": 39
        }
      ]
    }

In the preceding example, there are 452 active series, of which 57 are referenced in Grafana dashboards. Each metric object contains its active series count, with an additional breakdown by job label.

To reduce usage, use allowlist, or keep only the referenced metrics and drop everything else, or go through high-cardinality metrics in the additional_metric_counts section and choose which metrics to drop.

Step 3: Drop unused active metrics with relabel_config

In this step you’ll learn how to allowlist and denylist metrics extracted in the previous steps.

To learn more about the concepts discussed in this section, see Reducing Prometheus metrics usage with relabeling.

A Prometheus or Grafana Agent configuration has a remote_write configuration block similar to that which follows:

remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
  basic_auth:
    username: <Your Cloud Prometheus instance ID>
    password: <Your Cloud Prometheus API key>

This block can accept a write_relabel_configs stanza that allows you relabel, keep, and drop metrics and labels before shipping them to the remote_write endpoint. To learn more about write_relabel_configs parameters, see <relabel_config> from the Prometheus documentation.

If you want to construct an allowlist of metrics, you don’t need the output from analyze prometheus. You can construct the allowlist directly from metrics-in-grafana.json by using the following bash command, after you have installed the jq command-line utility:

jq '.metricsUsed' metrics-in-grafana.json \
| tr -d '", ' \
| sed '1d;$d' \
| grep -v 'grafanacloud*' \
| paste -s -d '|' -

This command does the following:

  • Uses jq to extract the metricsUsed object from the metrics-in-grafana.json JSON file
  • Uses tr to remove double quotes, commas, and spaces
  • Uses sed to remove the first and last line (which are array brackets)
  • Uses grep to filter out metrics that begin with grafanacloud (these are from the Billing dashboard)
  • Uses paste to format the output into relabel_config regex format

The output looks similar to that which follows:

instance:node_cpu_utilisation:rate1m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate1m|instance:node_network_receive_drop_excluding_lo:rate1m|instance:node_network_transmit_bytes_excluding_lo:rate1m|instance:node_network_transmit_drop_excluding_lo:rate1m|instance:node_vmstat_pgmajfault:rate1m|instance_device:node_disk_io_time_seconds:rate1m|instance_device:node_disk_io_time_weighted_seconds:rate1m|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_uname_info|up

You might need to modify the bash command depending on your metric output.

YOu can place this regular expression into a keep directive:

remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
  basic_auth:
    username: <Your Cloud Prometheus instance ID>
    password: <Your Cloud Prometheus API key>
  write_relabel_configs:
  - source_labels: [__name__]
    regex: instance:node_cpu_utilisation:rate1m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate1m|instance:node_network_receive_drop_excluding_lo:rate1m|instance:node_network_transmit_bytes_excluding_lo:rate1m|instance:node_network_transmit_drop_excluding_lo:rate1m|instance:node_vmstat_pgmajfault:rate1m|instance_device:node_disk_io_time_seconds:rate1m|instance_device:node_disk_io_time_weighted_seconds:rate1m|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_uname_info|up
    action: keep

This instructs Prometheus or Grafana Agent to only keep metrics whose metric name is matched by the regular expression. All other metrics get dropped.

Note: Because this step is in the remote_write block, it only gets run before shipping metrics to Grafana Cloud. You still have these metrics available locally. If you are using Grafana Agent, which does not store metrics locally, then you no longer have access to these metrics.

Similarly, you can also use the drop directive:

remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
  basic_auth:
    username: <Your Cloud Prometheus instance ID>
    password: <Your Cloud Prometheus API key>
  write_relabel_configs:
  - source_labels: [__name__]
    regex: node_scrape_collector_success|node_scrape_collector_duration_seconds
    action: drop

This drops the node_scrape_collector_success and node_scrape_collector_duration_seconds metrics, which were in the additional_metric_counts section of the analyze prometheus output. This denylisting approach can help you quickly omit the worst offending metrics in order to reduce usage instead of a more heavy-handed allowlist approach.

You can also drop and keep time series, based on labels other than __name__. This can be useful for high-cardinality metrics for which you only need certain labels.

Conclusion

In this guide you learned how to use mimirtool analyze commands to idenitfy metrics that are referenced in Grafana dashboards. You then identified active series not referenced in dashboards with analyze prometheus, and finally configured Prometheus to allowlist dashboard metrics.

The mimirtool analyze grafana command might encounter parse errors, and mimirtool analyze prometheus only looks at active series. Although you are not billed for inactive series, you might still be storing older metrics that aren’t picked up by mimirtool analyze prometheus.

Finally, there are some metrics that you might not have in a dashboard or rule that you might want to query or analyze in the future, especially during an incident. Keep this in mind when choosing which metrics to keep or drop.

To learn more about analyzing and reducing metric usage, see Control Prometheus metrics usage.