Analyze and reduce metrics usage with Grafana Mimirtool
In this guide you’ll learn how to use Grafana Mimirtool to identify high-cardinality metrics that are not referenced in your Grafana dashboards or Prometheus rules. Using this list of “unused metrics,” you can then leverage Prometheus’s relabel config feature to drop metrics and labels that you might not want to ship to Grafana Cloud for long-term storage and querying. This can help you reduce your active series usage and monthly bill.
Grafana Mimirtool supports extracting metrics from:
- Grafana dashboards in a Grafana instance
- Prometheus alerting and recording rules in a Cloud Prometheus instance
- Grafana dashboard JSON files
- Prometheus recording and alerting rule YAML files
Grafana Mimirtool can then compare these extracted metrics to active series in a Prometheus or Cloud Prometheus instance, and output a list of “used” metrics and “unused” metrics:
- “Used metrics” are metrics that are referenced in a dashboard or rule that you are actively shipping to Grafana Cloud
- “Unused metrics” are metrics that are not referenced in a dashboard or rule that you are actively shipping to Grafana Cloud
Warning: There are some metrics that you might not have in a dashboard or rule that you may want to query or analyze in the future, especially during an incident. Keep this in mind when choosing which metrics to keep or drop.
This guide covers an end-to-end example of extracting metrics from a dashboard, finding “used” metrics, and generating a relabel_config
to only keep those and drop everything else.
Prerequisites
Before you begin with this guide, have the following things available:
- Grafana Mimirtool installed and available on your machine. To learn how to install Grafana Mimirtool, see Installation.
- A Grafana Cloud account. To create an account, see Grafana Cloud and click on Start for free.
- A Grafana Cloud API key with the Admin role.
- An API key for your managed Grafana instance. You can learn how to create a Grafana API key in Create API token.
- Prometheus or Grafana Agent installed in your environment and configured to ship metrics to Grafana Cloud.
Step 1: Identify metrics referenced in Grafana dashboards
In this step you’ll use mimirtool analyze grafana
to extract metrics that are referenced in your managed Grafana dashboards.
Grafana Mimirtool includes two built-in commands for extracting metrics from dashboards:
mimirtool analyze grafana
, which fetches dashboards from a managed Grafana or OSS Grafana instancemimirtool analyze dashboard
, which extracts metrics from dashboard JSON files
To begin, make sure that you are shipping some metrics to your Cloud Prometheus endpoint, and have some dashboards in your managed Grafana instance. If you haven’t already, create an API key for your managed Grafana instance:
Click Administration in the left-side menu of your managed Grafana instance. Then, click API keys.
Create an API key with the Viewer
role:
Using your managed Grafana API key, run mimirtool analyze grafana
to extract metrics from your Grafana dashboards. Note that this API key is different from your Grafana Cloud API key, which configures authentication for Cloud Prometheus.
mimirtool analyze grafana --address=https://your_stack_name.grafana.net --key=<Your Grafana API key>
Using your managed Grafana API key, run mimirtool analyze grafana
to extract metrics from your Grafana dashboards.
Note: This API key is different from your Grafana Cloud API key, which configures authentication for Cloud Prometheus.
mimirtool analyze grafana --address=https://your_stack_name.grafana.net --key=<Your Grafana API key>
mimirtool
downloads dashboards from your managed Grafana instance and parses out metrics that are referenced in the dashboard’s PromQL queries. It then saves the ouput in a file called metrics-in-grafana.json
. If it encounters errors while parsing a dashboard, the errors are stored in a parse_errors
field of the JSON output.
The output looks similar to something like this:
{
"metricsUsed": [
"grafanacloud_instance_active_series",
"grafanacloud_instance_info",
"grafanacloud_instance_samples_discarded_per_second",
"grafanacloud_instance_samples_per_second",
"grafanacloud_logs_instance_bytes_received_per_sec"
],
"dashboards": [
{
"slug": "",
"uid": "LximWqMnz",
"title": "Grafana Cloud Billing/Usage",
"metrics": [
"grafanacloud_instance_active_series",
"grafanacloud_instance_info",
"grafanacloud_instance_samples_discarded_per_second",
"grafanacloud_instance_samples_per_second",
"grafanacloud_logs_instance_bytes_received_per_second",
"grafanacloud_logs_instance_info"
],
"parse_errors": null
},
{
"slug": "",
"uid": "zxHdvqM7z",
"title": "Nodes",
"metrics": [
"node_cpu_seconds_total",
"node_disk_io_time_seconds_total",
"node_disk_read_bytes_total"
The metricsUsed
object contains all of the metrics that are referenced across all dashboards. Each dashboard also has a metrics
array with its extracted metrics.
You can follow a similar procedure to extract metrics from:
- Dashboard JSON files directly (
analyze dashboard
) - Cloud Prometheus rules (
analyze ruler
) - Prometheus rule YAML files (
analyze rule-file
)
To learn more about these commands, refer to Grafana Mimirtool
With this list of referenced metrics in place, we can move on to using analyze prometheus
to identify active metrics that are not on this list.
With this list of referenced metrics, you can use analyze prometheus
to identify active metrics that are not in this list.
Step 2: Identify unused active metrics
In this step, you can use mimirtool analyze prometheus
to identify active series that are not referenced in any dashboard or rule.
Warning: There are some metrics that you might not have in a dashboard or rule that you might want to query or analyze in the future, especially during an incident. Keep this in mind when choosing which metrics to keep or drop.
Note that analyze prometheus
uses the JSON output from the previous step to determine metrics that are “used.” If you pass in both metrics-in-grafana.json
and metrics-in-ruler.json
files, it’ll construct one array of “used” metrics.
analyze prometheus
can work against any Prometheus API (including Cloud Prometheus and OSS Prometheus).
Run the command with the appropriate parameters:
mimirtool analyze prometheus --address=<Your Cloud Prometheus query endpoint> --id=<Your Cloud Prometheus instance ID> --key=<Your Cloud API key> --log.level=debug
You can find your Prometheus query endpoint and instance ID from the Prometheus panel of the Cloud Web Portal. To learn how to create an API key for your Cloud Prometheus endpoint, see Create an API key.
By default, this command looks for a local file called metrics-in-grafana.json
. You can modify the default behavior with the grafana-metrics-file
flag. To enable more verbose output, use the --log.level=debug
flag.
Depending on your metric volume, the command might take several minutes to run. When it’s done, you should see something like the following example:
INFO[0000] Found 243 metric names
INFO[0003] 57 active series are being used in dashboards
INFO[0019] 395 active series are NOT being used in dashboards
This indicates that mimirtool found 243 unique metric names.
Note: A given metric can have multiple labels, and an active series is a unique combination of metric name and one or more labels. To learn more, see Prometheus time series.
If you only want to keep metrics that are referenced in your hosted Grafana dashboards, drop your active series usage by 395.
Inspect the command’s output file, prometheus-metrics.json
:
{
"total_active_series": 452,
"in_use_active_series": 57,
"additional_active_series": 395,
"in_use_metric_counts": [
{
"metric": "node_cpu_seconds_total",
"count": 16,
"job_counts": [
{
"job": "integrations/node_exporter",
"count": 16
}
]
},
]
"additional_metric_counts": [
{
"metric": "node_scrape_collector_success",
"count": 39,
"job_counts": [
{
"job": "integrations/node_exporter",
"count": 39
}
]
},
{
"metric": "node_scrape_collector_duration_seconds",
"count": 39,
"job_counts": [
{
"job": "integrations/node_exporter",
"count": 39
}
]
}
In the preceding example, there are 452 active series, of which 57 are referenced in Grafana dashboards. Each metric object contains its active series count, with an additional breakdown by job
label.
To reduce usage, use allowlist, or keep only the referenced metrics and drop everything else, or go through high-cardinality metrics in the additional_metric_counts
section and choose which metrics to drop.
Step 3: Drop unused active metrics with relabel_config
In this step you’ll learn how to allowlist and denylist metrics extracted in the previous steps.
To learn more about the concepts discussed in this section, see Reducing Prometheus metrics usage with relabeling.
A Prometheus or Grafana Agent configuration has a remote_write
or prometheus_remote_write
configuration block similar to that which follows:
remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
basic_auth:
username: <Your Cloud Prometheus instance ID>
password: <Your Cloud Prometheus API key>
This block can accept a write_relabel_configs
stanza that allows you relabel, keep
, and drop
metrics and labels before shipping them to the remote_write
endpoint. To learn more about write_relabel_configs
parameters, see <relabel_config>
from the Prometheus documentation.
If you want to construct an allowlist of metrics, you don’t need the output from analyze prometheus
. You can construct the allowlist directly from metrics-in-grafana.json
by using the following bash command, after you have installed the jq
command-line utility:
jq '.metricsUsed' metrics-in-grafana.json \
| tr -d '", ' \
| sed '1d;$d' \
| grep -v 'grafanacloud*' \
| paste -s -d '|' -
This command does the following:
- Uses
jq
to extract themetricsUsed
object from themetrics-in-grafana.json
JSON file - Uses
tr
to remove double quotes, commas, and spaces - Uses
sed
to remove the first and last line (which are array brackets) - Uses
grep
to filter out metrics that begin withgrafanacloud
(these are from the Billing dashboard) - Uses
paste
to format the output intorelabel_config
regex format
The output looks similar to that which follows:
instance:node_cpu_utilisation:rate1m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate1m|instance:node_network_receive_drop_excluding_lo:rate1m|instance:node_network_transmit_bytes_excluding_lo:rate1m|instance:node_network_transmit_drop_excluding_lo:rate1m|instance:node_vmstat_pgmajfault:rate1m|instance_device:node_disk_io_time_seconds:rate1m|instance_device:node_disk_io_time_weighted_seconds:rate1m|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_uname_info|up
You might need to modify the bash command depending on your metric output.
You can place this regular expression into a keep
directive:
remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
basic_auth:
username: <Your Cloud Prometheus instance ID>
password: <Your Cloud Prometheus API key>
write_relabel_configs:
- source_labels: [__name__]
regex: instance:node_cpu_utilisation:rate1m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate1m|instance:node_network_receive_drop_excluding_lo:rate1m|instance:node_network_transmit_bytes_excluding_lo:rate1m|instance:node_network_transmit_drop_excluding_lo:rate1m|instance:node_vmstat_pgmajfault:rate1m|instance_device:node_disk_io_time_seconds:rate1m|instance_device:node_disk_io_time_weighted_seconds:rate1m|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_uname_info|up
action: keep
This instructs Prometheus or Grafana Agent to only keep metrics whose metric name is matched by the regular expression. All other metrics get dropped.
Note: Because this step is in the remote_write
block, it only gets run before shipping metrics to Grafana Cloud. You still have these metrics available locally. If you are using Grafana Agent, which does not store metrics locally, then you no longer have access to these metrics.
Similarly, you can also use the drop
directive:
remote_write:
- url: <Your Cloud Prometheus metrics instance remote_write endpoint>
basic_auth:
username: <Your Cloud Prometheus instance ID>
password: <Your Cloud Prometheus API key>
write_relabel_configs:
- source_labels: [__name__]
regex: node_scrape_collector_success|node_scrape_collector_duration_seconds
action: drop
This drops the node_scrape_collector_success
and node_scrape_collector_duration_seconds
metrics, which were in the additional_metric_counts
section of the analyze prometheus
output. This denylisting approach can help you quickly omit the worst offending metrics in order to reduce usage instead of a more heavy-handed allowlist approach.
You can also drop
and keep
time series, based on labels other than __name__
. This can be useful for high-cardinality metrics for which you only need certain labels.
Conclusion
In this guide you learned how to use mimirtool analyze
commands to idenitfy metrics that are referenced in Grafana dashboards. You then identified active series not referenced in dashboards with analyze prometheus
, and finally configured Prometheus to allowlist dashboard metrics.
The mimirtool analyze grafana
command might encounter parse errors, and mimirtool analyze prometheus
only looks at active series. Although you are not billed for inactive series, you might still be storing older metrics that aren’t picked up by mimirtool analyze prometheus
.
Finally, there are some metrics that you might not have in a dashboard or rule that you might want to query or analyze in the future, especially during an incident. Keep this in mind when choosing which metrics to keep or drop.
To learn more about analyzing and reducing metric usage, see Control Prometheus metrics usage.