Controlling and analyzing Hosted Prometheus usage

MetricsPrometheusControlling and analyzing usage

There are a few ways to limit what you send to Grafana Cloud through configuration. Finding which metrics are the biggest (have the highest cardinality) is a common question we answer below.

Limiting active series sent to Grafana Cloud

You can use Prometheus write_relabel_configs or metric_relabel_configs to drop some series before sending them to the remote cloud storage.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

If you want to store those metrics locally but not send them to the remote cloud storage, then you can use the write_relabel_configs in the remote_write section.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

If you don’t want to store those metrics locally, you can use the metric_relabel_configs in the scrape_configs sections.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configs

A common use case that produces lots of “unnecessary” metrics is monitoring Kubernetes clusters. The following example can be used to drop some of the high cardinality metrics collected from Kubernetes:

metric_relabel_configs:
  - source_labels: [__name__, image]
    separator: ;
    regex: container_([a-z_]+);
    replacement: $1
    action: drop
  - source_labels: [__name__]
    separator: ;
    regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
    replacement: $1
    action: drop

The general structure can be applied to other applications by changing the regex.

Finding the biggest metrics

The “Grafana Cloud Billing/Usage” dashboard reports the total number of time series. You might want to find which ones are your biggest metrics. There are several methods for doing that.

PromQL method

If your instance is small (less than 100K series), then you can use a PromQL query. This article from robustperception.io shares a query that can be used on your Prometheus expression browser:
https://www.robustperception.io/which-are-my-biggest-metrics

topk(10, count by (__name__)({__name__=~".+"}))

If you try this in Grafana Explore against your Hosted Prometheus instance, you will get this error message:

“expanding series: query must contain metric name”

To protect against expensive queries, Cloud Prometheus will refuse to run queries without a metric name if the time range is more than 5 minutes. So set the time range to Last 5 minutes. And if the database has hundreds of thousands of series, than this will time out anyways, so look at the other methods listed below.

Metadata API method

If you have a large instance (more than 100K series), then we suggest using the metadata API instead.
https://prometheus.io/docs/prometheus/latest/querying/api/

This method uses the metadata API which queries the entire index. This may not accurately report the active series. Use the Query API method below if you want to limit the report to a point in time.

The following script use the bash syntax and require jq. It also assumes GNU sort (for the -n option).

Variables to define:

login="<instance ID>:<grafana.com Viewer API token>"
url=https://prometheus-us-central1.grafana.net/api/prom

You can find the instance ID and the URL on your Hosted Metrics details page in the grafana.com customer portal.

Get the list of all metric names sorted by name:

curl -s -u $login $url/api/v1/label/__name__/values | jq -r ".data[]" | sort

Get the list of all metric names sorted by cardinality:

curl -s -u $login $url/api/v1/label/__name__/values | jq -r ".data[]" \
| while read metric; do
    echo $(curl -s -u $login $url/api/v1/series -d "match[]=$metric" | jq -r ".data|length") $metric
  done \
| sort -n

Similarly, you can use this script to count the number of unique values for each label name in the database.

curl -s -u $login $url/api/v1/labels \
| jq -r ".data[]" \
| while read label; do
    count=$(curl -s -u $login $url/api/v1/label/$label/values \
    | jq -r ".data|length")
    echo "$count $label"
  done \
| sort -n

Query API method

This method uses the Query API which takes a time parameter.

The following script use the bash syntax and require jq. It also assumes GNU sort (for the -n option).

Variables to define:

login="<instance ID>:<grafana.com Viewer API token>"
url=https://prometheus-us-central1.grafana.net/api/prom

Get the list of metric names and their cardinality when there are hundreds of thousands of series:

With the script below, you first get a list of metric names (using the __name__ meta-label), then run a count() query over each metric so it spreads out the load enough that the queries don’t time out. It takes a few minutes to complete the entire loop.

now=$(date +%s)
curl -s -u $login $url/api/v1/label/__name__/values \
| jq -r ".data[]" \
| while read metric; do
    count=$(curl -s \
        -u $login \
        --data-urlencode 'query=count({__name__="'$metric'"})' \
        --data-urlencode "time=$now" \
        $url/api/v1/query \
    | jq -r ".data.result[0].value[1]")
    echo "$count $metric"
done

You can get the list of series which are part of a specified metric.

now=$(date +%s)
metric=
curl -s \
    -u $login \
    --data-urlencode "query=$metric" \
    --data-urlencode "time=$now" \
    $url/api/v1/query \
| jq -c ".data.result[].metric"