Control your metrics usagePrometheusAnalyzing and reducing metrics usage with cortextool

Analyzing and reducing metrics usage with cortextool

In this guide you’ll learn how to use cortextool to identify high-cardinality metrics that are not referenced in your Grafana dashboards or Prometheus rules. Using this list of “unused metrics,” you can then leverage Prometheus’s relabel config feature to drop metrics and labels that you may not want to ship to Grafana Cloud for long term storage and querying. This can help you reduce your active series usage and monthly bill.

cortextool support extracting metrics from:

  • Grafana dashboards in a Grafana instance
  • Prometheus alerting and recording rules in a Cloud Prometheus instance
  • Grafana dashboard JSON files
  • Prometheus recording and alerting rule YAML files

cortextool can then compare these extracted metrics to active series in a Prometheus or Cloud Prometheus instance, and output a list of “used” metrics and “unused” metrics:

  • “Used metrics” are metrics that are referenced in a dashboard or rule that you are actively shipping to Grafana Cloud
  • “Unused metrics” are metrics that are not referenced in a dashboard or rule that you are actively shipping to Grafana Cloud

Warning: There are some metrics that you may not have in a dashboard or rule that you may want to query or analyze in the future, especially during an incident. Bear this in mind when choosing which metrics to keep or drop.

This guide will cover an end-to-end example of extracting metrics from a dashboard, finding “used” metrics, and generating a relabel_config to only keep those and drop everything else.

Prerequisites

Before you begin with this guide, you should have the following available to you:

  • cortextool installed and available on your machine. To learn how to install cortextool, please see Installation from the cortextool GitHub repo.
  • A Grafana Cloud account. To create an account, please see Grafana Cloud and click on Start for free.
  • A Grafana Cloud API key with the Admin role.
  • An API key for your hosted Grafana instance. You can learn how to create a Grafana API key in Create API Token.
  • Prometheus or the Grafana Agent installed in your environment and configured to ship metrics to Grafana Cloud.

Step 1: Identify metrics referenced in Grafana dashboards

In this step you’ll use cortextool analyse grafana to extract metrics referenced in your hosted Grafana dashboards.

cortextool includes two built-in commands for extracting metrics from dashboards:

  • cortextool analyse grafana, which fetches dashboards from a hosted Grafana or OSS Grafana instance
  • cortextool analyse dashboard, which extracts metrics from dashboard JSON files

To begin, make sure that you are shipping some metrics to your Cloud Prometheus endpoint, and have some dashboards in your hosted Grafana instance. If you haven’t already, create an API key for your hosted Grafana instance:

Click on the Configuration menu item in the left-hand nav of your hosted Grafana instance. Then, click on API Keys:

API key menu item

Create an API key with the Viewer role:

Create API key view

Using your hosted Grafana API key, run cortextool analyse grafana to extract metrics from your Grafana dashboards. Note that this API key is different from your Grafana Cloud API key, which configures authentication for Cloud Prometheus.

cortextool analyse grafana --address=https://your_stack_name.grafana.net --key=<Your Grafana API key>

cortextool will download dashboards from your hosted Grafana instance and parse out metrics referenced in the dashboard’s PromQL queries. It will save the ouput in a file called metrics-in-grafana.json. If it encounters errors parsing a dashboard, these will be stored in a parse_errors field of the JSON output.

The output will look something like this:

{
  "metricsUsed": [
    "grafanacloud_instance_active_series",
    "grafanacloud_instance_info",
    "grafanacloud_instance_samples_discarded_per_second",
    "grafanacloud_instance_samples_per_second",
    "grafanacloud_logs_instance_bytes_received_per_sec",
   ...
   ],
   "dashboards": [
   {
    "slug": "",
    "uid": "LximWqMnz",
    "title": "Grafana Cloud Billing/Usage",
    "metrics": [
      "grafanacloud_instance_active_series",
      "grafanacloud_instance_info",
      "grafanacloud_instance_samples_discarded_per_second",
      "grafanacloud_instance_samples_per_second",
      "grafanacloud_logs_instance_bytes_received_per_second",
      "grafanacloud_logs_instance_info",
    ...
   	],
  	"parse_errors": null
	  },
    {
      "slug": "",
      "uid": "zxHdvqM7z",
      "title": "Nodes",
      "metrics": [
        "node_cpu_seconds_total",
        "node_disk_io_time_seconds_total",
        "node_disk_read_bytes_total",
        ...

The metricsUsed object contains all metrics referenced across all dashboards. Each dashboard also has a metrics array with its extracted metrics.

You can follow a similar procedure to extract metrics from:

  • Dashboard JSON files directly (analyse dashboard)
  • Cloud Prometheus rules (analyse ruler)
  • Prometheus rule YAML files (analyse rule-file)

To learn more about these commands, please see the cortextool docs.

With this list of referenced metrics in place, we can move on to using analyse prometheus to identify active metrics that are not on this list.

Step 2: Identify unused active metrics

In this step we’ll use cortextool analyse prometheus to identify active series that are not referenced in any dashboard or rule.

Warning: There are some metrics that you may not have in a dashboard or rule that you may want to query or analyze in the future, especially during an incident. Bear this in mind when choosing which metrics to keep or drop.

Note that analyse prometheus uses the JSON output from the previous step to determine metrics that are “used.” If you pass in both metrics-in-grafana.json and metrics-in-ruler.json files, it’ll construct one array of “used” metrics.

analyse prometheus can work against any Prometheus API (including Cloud Prometheus and OSS Prometheus).

Run the command with the appropriate parameters:

cortextool analyse prometheus --address=<Your Cloud Prometheus query endpoint> --id=<Your Cloud Prometheus instance ID> --key=<Your Cloud API key> --log.level=debug

You can find your Prometheus query endpoint and instance ID from the Prometheus panel of the Cloud Web Portal. Please see Create an API Key to learn how to create an API key for your Cloud Prometheus endpoint.

This command will look for a local file called metrics-in-grafana.json by default. You can modify this with the grafana-metrics-file flag. You can also enable more verbose output using the --log.level=debug flag.

Depending on your metric volume, the command may take several minutes to run. When it’s done, you should see something like the following:

INFO[0000] Found 243 metric names
INFO[0003] 57 active series are being used in dashboards
INFO[0019] 395 active series are NOT being used in dashboards

This indicates that cortextool found 243 unique metric names. Note that a given metric can have multiple labels, and an active series is unique combination of metric name and one or more labels. To learn more, please see Prometheus time series.

If we only want to keep metrics referenced in our hosted Grafana dashboards, we’ll be able to drop our active series usage by 395.

Inspect the command’s output file, prometheus-metrics.json:

{
  "total_active_series": 452,
  "in_use_active_series": 57,
  "additional_active_series": 395,
  "in_use_metric_counts": [
    {
      "metric": "node_cpu_seconds_total",
      "count": 16,
     	"job_counts": [
        {
          "job": "integrations/node_exporter",
          "count": 16
        }
      ]
    },
    ...
   ]
    "additional_metric_counts": [
    {
      "metric": "node_scrape_collector_success",
      "count": 39,
      "job_counts": [
        {
          "job": "integrations/node_exporter",
          "count": 39
        }
      ]
    },
    {
      "metric": "node_scrape_collector_duration_seconds",
      "count": 39,
      "job_counts": [
        {
          "job": "integrations/node_exporter",
          "count": 39
        }
      ]
    },
	...

Here we see that we have 452 active series, of which 57 are referenced in Grafana dashboards. Each metric object contains its active series count, with an additional breakdown by job label.

To reduce usage, we can allowlist, or keep only the referenced metrics, dropping everything else, or we can go through high-cardinality metrics in the additional_metric_counts section and choose which metrics to drop.

Step 3: Drop unused active metrics with relabel_config

In this step you’ll learn how to allowlist and denylist metrics extracted in the previous steps.

To learn more about the concepts discussed in this section, please see Reducing Prometheus metrics usage with relabeling.

A Prometheus or Grafana Agent config will have a remote_write config block similar to the following:

remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
  basic_auth:
    username: <Your Cloud Prometheus instance ID>
    password: <Your Cloud Prometheus API key>

This block can accept a write_relabel_configs stanza that allows you relabel, keep and drop metrics and labels before shipping them to the remote_write endpoint. To learn more about write_relabel_configs parameters, please see <relabel_config> from the Prometheus docs.

If we want to construct an allowlist of metrics, we don’t need the output from analyse prometheus. We can construct the allowlist directly from metrics-in-grafana.json using the following bash one-liner:

jq '.metricsUsed' metrics-in-grafana.json \
| tr -d '", ' \
| sed '1d;$d' \
| grep -v 'grafanacloud*' \
| paste -s -d '|' -

Be sure to install the jq command-line utility before running this command.

This command does the following:

  • Uses jq to extract the metricsUsed object from the metrics-in-grafana.json JSON file
  • Uses tr to remove double quotes, commas, and spaces
  • Uses sed to remove the first and last line (which are array brackets)
  • Uses grep to filter out metrics that begin with grafanacloud (these are from the Billing dashboard)
  • Uses paste to format the output into relabel_config regex format

The output should look something like this:

instance:node_cpu_utilisation:rate1m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate1m|instance:node_network_receive_drop_excluding_lo:rate1m|instance:node_network_transmit_bytes_excluding_lo:rate1m|instance:node_network_transmit_drop_excluding_lo:rate1m|instance:node_vmstat_pgmajfault:rate1m|instance_device:node_disk_io_time_seconds:rate1m|instance_device:node_disk_io_time_weighted_seconds:rate1m|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_uname_info|up

You may need to modify the bash command depending on your metric output.

We can now place this regex into a keep directive:

remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
  basic_auth:
    username: <Your Cloud Prometheus instance ID>
    password: <Your Cloud Prometheus API key>
  write_relabel_configs:
  - source_labels: [__name__]
    regex: instance:node_cpu_utilisation:rate1m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate1m|instance:node_network_receive_drop_excluding_lo:rate1m|instance:node_network_transmit_bytes_excluding_lo:rate1m|instance:node_network_transmit_drop_excluding_lo:rate1m|instance:node_vmstat_pgmajfault:rate1m|instance_device:node_disk_io_time_seconds:rate1m|instance_device:node_disk_io_time_weighted_seconds:rate1m|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_uname_info|up
    action: keep

This instructs Prometheus or the Grafana Agent to only keep metrics whose metric name is matched by the regex. All other metrics will get dropped. Note that since this step is in the remote_write block, it will only get run before shipping metrics to Grafana Cloud and you will still have these metrics available locally. If you are using the Grafana Agent, which does not store metrics locally, then you will no longer have access to these metrics.

We can also use the drop directive in a similar fashion:

remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
  basic_auth:
    username: <Your Cloud Prometheus instance ID>
    password: <Your Cloud Prometheus API key>
  write_relabel_configs:
  - source_labels: [__name__]
    regex: node_scrape_collector_success|node_scrape_collector_duration_seconds
    action: drop

This drops the node_scrape_collector_success and node_scrape_collector_duration_seconds metrics which were in the additional_metric_counts section of the analyse prometheus output. This denylisting approach can help you quickly omit the worst offender metrics to reduce usage instead of a more heavy-handed allowlist approach.

You can also drop and keep time series based on labels other than __name__. This can be useful for high-cardinality metrics for which you only need certain labels.

Conclusion

In this guide you learned how to use cortextool analyse commands to idenitfy metrics referenced in Grafana dashboards. You then identified active series not referenced in dashboards with analyse prometheus, and finally configured Prometheus to allowlist dashboard metrics.

cortextool analyse grafana may encounter parse errors and analyse prometheus only looks at active series. You may still be storing older metrics that aren’t picked up by analyse prometheus (however you are not billed for inactive series).

Finally, there are some metrics that you may not have in a dashboard or rule that you may want to query or analyze in the future, especially during an incident. Bear this in mind when choosing which metrics to keep or drop.

To learn more about analyzing and reducing metric usage, please see Control Prometheus metrics usage.