Analyzing Grafana Cloud Graphite usage

MetricsGraphiteAnalyzing Grafana Cloud Graphite usage

This page explains:

  • Finding which are the more commonly used metrics prefixes or patterns amongst your metrics.
  • Finding metrics that have the most cardinality (distinct combinations of tag values).

Knowing this can help with reducing your Grafana Cloud Graphite usage

The “Grafana Cloud Billing/Usage” dashboard reports the total number of active time series.

You might want to find out what is driving this number up. Often the largest groups of metrics have the same prefix, the same pattern, or the same name (if you use tags to differentiate metrics).

There are several methods for investigation.

The following scripts require you to authenticate with grafana.com credentials. You can find the instance ID and the URL on your Grafana Cloud details page in the Grafana Cloud Portal

Analysis of a metrics listing

Below, we explain two approaches to obtain a metrics listing, which is a text file with one line per metric. For the analysis examples below, we assume we have a file metrics-index.txt with these contents:

stats.count.app1._root.visits
stats.timers.app1._root.timer.mean
stats.timers.app1._root.timer.lower
stats.timers.app1._root.timer.upper
stats.timers.app1._root.timer.upper_99
servers.dc1.server1.disk_free
servers.dc1.server2.disk_free
servers.dc1.server3.disk_free

Breakdown per prefix

Graphite users tend to classify their metrics by a prefix, which is typically a string of two or more “nodes” (dot-separated strings of the metric name). In this example, we get an overview of the different prefixes used, along with their counts.

cut -f1,2 -d. metrics-index.txt | sort | uniq -c | sort -n
      1 stats.count
      3 servers.dc1
      4 stats.timers

This is very helpful in showing which are the top commonly used prefixes. In this example, we can easily demonstrate that half of our metrics are statsd timers.

Estimating savings when aggregating metrics

In this example, we simulate the effect of introducing an aggregation. Each set of metrics that have the same name except for the datacenter and server name, get reduced to a single metric per datacenter that covers all servers. In this example, we can reduce our set of metrics by 25%.

$ wc -l metrics-index.txt
8 metrics-index.txt
$ sed 's/^servers\.\(dc[0-9]\+\)\.server[0-9]\+\.*/servers.\1.servers_total./' metrics-index.txt | sort | uniq | wc -l
6

Manual drilldown with the Metrics Find API

The Find API can be used to explore the subtree matching a certain query. This can be a useful first step in diagnosis.

You can manually query the /metrics/find endpoint which is documented in Finding metrics.

Note: This /metrics/find endpoint is for untagged metrics. Tagged metrics are neither matched nor returned.

Example:

curl -u ${USER_ID}:${API_KEY}
  -s -G \
  --data-urlencode "query=stats.*"
  '${GRAPHITE_ENDPOINT}/metrics/find/' \
| jq -r '.[].text'
...
count
timers

The result shows possible children. Drill down by modifying the query string and appending .* or any other supported patterns. E.g. query=stats.count.*

Automating find queries using the walk_metrics.py script

The walk_metrics.py script automates the process of recursively calling the /metrics/find API. It explores an entire hierarchy under a given root prefix (which might be "" to cover the entire metrics space).

The output is a metrics listing as described above: a list of all the metric names seen under the provided prefix, one per line. The script is multi-threaded so the output is not perfectly sorted, then you might need to pipe it through the sort utility.

Scanning through the list visually might make certain patterns obvious, but various kinds of precise analysis can be done using nothing more than standard shell tools, as shown in the section above.

Note:

  • This does not work when Graphite tags are used.
  • If there are hundreds of thousands of series, then the script might take over an hour to finish.

Installation:

mkdir walk_metrics
cd walk_metrics
wget https://raw.githubusercontent.com/grafana/cloud-graphite-scripts/master/query/walk_metrics.py
chmod +x walk_metrics.py

Requires requests. You can install it in a virtualenv, but this is out of the scope for this documentation. You can install it system-wide:

  • sudo dnf install python3-requests (RedHat based distributions)
  • sudo apt install python3-requests (Debian based distributions)

Using the walk_metrics.py script:

usage: walk_metrics.py [-h] --url URL [--prefix PREFIX] [--user USER] [--password PASSWORD] [--concurrency CONCURRENCY] [--from SERIESFROM] [--depth DEPTH]

optional arguments:
  -h, --help            show this help message and exit
  --url URL             Graphite URL
  --prefix PREFIX       Metrics prefix
  --user USER           Basic Auth username
  --password PASSWORD   Basic Auth password
  --concurrency CONCURRENCY
                        Concurrency
  --from SERIESFROM     Only get series that have been active since this time.
  --depth DEPTH         Maximum depth to traverse. If set, then the branches at the depth are printed.

Example of using the walk_metrics.py script:

walk_tree.py \
  --url https://graphite-us-central1.grafana.net/graphite \
  --user <user> \
  --password <API Token> \
  --from=-1w \
| tee metrics-index.txt

# Optionally:
sort metrics-index.txt > metrics-index-sorted.txt

The --from parameter is documented in Finding metrics.

The countSeries() function

If you know which are the metrics you need to monitor, then you can use the countSeries() function in Graphite’s query language to count the number of nodes found in a seriesList.

Note: This function does not resolve the pattern recursively. E.g. countSeries(foo.*) would take into account foo.bar but not foo.x.y.z. This is why people sometimes query countSeries(foo.*)&countSeries(foo.*.*)&countSeries(foo.*.*.*). If the backend returns an error that you’re trying to query too many series at once, you should try one of the other mentioned approaches.

For more information, refer to the countSeries documentation.

Note: This also works for tagged metrics. If you have these metrics:

foo.bar;t=v1
foo.bar;t=v2

Then countSeries(seriesByTag('name=foo.bar')) will return 2.

Measuring cardinality via carbon-relay-ng

All above methods query the API of your Grafana Cloud Graphite service to obtain insights. You can also get insights using carbon-relay-ng itself. Carbon-relay-ng is the agent typically used to send data to Grafana Cloud Graphite, and we can leverage its features to analyze the cardinality of metrics traffic passing through.

Setting up aggregations to capture insights for specific series

This is done by leveraging the aggregator functionality.

Let’s say you have metrics in a format like this flowing into your carbon-relay-ng, and you would like to know how many metrics per datacenter (dc) are seen during each 10 second interval.

servers.dc1.foo 123 1599854045
servers.dc1.bar 123 1599854046
servers.dc2.foo 123 1599854045

This can easily be achieved with a count aggregation like so:

[[aggregation]]
# count how many metrics are seen each 10s, broken down by dc
function = 'count'
regex = '^servers\.(dc[0-9]+)\..*'
format = 'aggregate_count.servers.$1'
interval = 5
wait = 10

This will cause the relay to emit timeseries such as aggregate_count.servers.dc1 and aggregate_count.servers.dc2 measuring, at each point in time how many metrics (points) are seen for each one. This is not quite the same as counting active series but it’s a good proxy measure especially if all metrics are sent at the same interval.

Note: The wait parameter is important. It should be set to the max time delay expected in the data. See the carbon-relay-ng aggregator documentation for more info.

Deriving insights for existing aggregations

Each aggregator defined in carbon-relay-ng emits interesting metrics that give you good clues into the volume (and the reduction of volume) of data they process.

Note: These metrics pertain to the entire aggregator and not segmented by output key like in the above example.

service_is_carbon-relay-ng.instance_is_$instance.mtype_is_counter.unit_is_Metric.direction_is_in.aggregator_is_*
service_is_carbon-relay-ng.instance_is_$instance.mtype_is_counter.unit_is_Metric.direction_is_out.aggregator_is_*

The meanings are quite simple: the amount of points going into an aggregator, and the amount of points getting flushed out of the aggregator. These are counters, so use perSecond() to see the rate per second.

Exploration methods that only work for tagged metrics

Many of the above mentioned approaches also work for tagged metrics. But there are a couple of additional API endpoints available to drill into tags specifically.

FindSeries API

/tags/findSeries is similar to /metrics/find, except for a given query it will return all matching nameWithTags of metrics. This way you can find out how many metrics match a given combination of tags.

curl -u ${USER_ID}:${API_KEY} "$GRAPHITE_ENDPOINT/findSeries?expr=os=ubuntu' -O -
["foo.bar;os=ubuntu;tag1=tag2","foo.bar;os=ubuntu;tag1=tag1"]

Tag value counts using /tags/terms

The /tags/terms endpoint returns a breakdown of tag values seen along with the number of them, for a given query.

Example:

curl -u ${USER_ID}:${API_KEY} "$GRAPHITE_ENDPOINT/tags/terms?expr=datacenter=dc1&expr=server=web01&tags=rack"

{
  "totalSeries": 5892,
  "terms": {
    "rack": {
      "a1": 2480,
      "a2": 465,
      "b1": 2480,
      "b2": 467
    }
  }
}