Reducing Graphite metrics usage
Once you have identified excess active metrics sent to Grafana Cloud Graphite, you typically want to reduce their volume.
There are various ways to achieve this:
- Stop sending them to carbon-relay-ng. This is typically cumbersome because monitoring agents such as collectd and diamond have limited options, usually a matter of completely turning off a plugin.
statsdimplementations typically don’t have any options to drop any traffic. The other downside of this approach is that if the traffic is sent to other locations than Grafana Cloud, they will also see the reduction in data.
- The Carbon-relay-ng blacklist is a simple list of regular expressions applied to your metrics. Any metric matching the expression is simply dropped. This works on both tagged and untagged metrics. If your
carbon-relay-ngsends traffic to multiple routes, none of them will see the dropped traffic. (There is also a validation step which allows you to drop data if, for example, it contains characters known to be problematic in Graphite. This is a very coarse way to go about dropping traffic, but worth mentioning).
- Carbon-relay-ng routes all have configuration to filter which traffic they accept. substrings, prefixes and regular expressions can be used to select the desired metrics only (but only one setting each per route).
- Finally, a popular option is to use Carbon-relay-ng aggregators which reduce one or more sets of metrics to a single metric (per set). The config documentation lists a few examples but for completeness we will describe an example below.
Let’s say you have metrics like the below:
servers.dc1.server1.cpu.core0.idle 10 1600255845 servers.dc1.server1.cpu.core0.user 90 1600255845 servers.dc1.server1.cpu.core1.idle 30 1600255845 servers.dc1.server1.cpu.core1.user 70 1600255845 servers.dc1.server2.cpu.core0.idle 5 1600255845 servers.dc1.server2.cpu.core0.user 95 1600255845 servers.dc1.server2.cpu.core1.idle 15 1600255845 servers.dc1.server2.cpu.core1.user 85 1600255845
A common aggregation strategy would be to aggregate the per-core metrics to per-machine metrics, like so:
[[aggregation]] function = 'avg' prefix = 'servers.dc' regex = '^servers\.(dc[0-9]+)\.(server[0-9]+)\.cpu\.core[0-9]+\.(.*)' format = 'servers.$1.$2.cpu.all_cores.$3' dropRaw = true interval = 5 wait = 10
This will reduce the input to the following output:
servers.dc1.server1.cpu.all_cores.user 80.000000 1600255845 servers.dc1.server2.cpu.all_cores.idle 10.000000 1600255845 servers.dc1.server2.cpu.all_cores.user 90.000000 1600255845 servers.dc1.server1.cpu.all_cores.idle 20.000000 1600255845
By tweaking the regex and format, you can generate aggregates across all machines and their cores, resulting in only per-datacenter metrics. Like so:
[[aggregation]] function = 'avg' prefix = 'servers.dc' regex = '^servers\.(dc[0-9]+)\.server[0-9]+\.cpu\.core[0-9]+\.(.*)' format = 'servers.$1.all_machines.cpu.all_cores.$2' dropRaw = true interval = 5 wait = 10
This results in:
servers.dc1.all_machines.cpu.all_cores.idle 15.000000 1600255845 servers.dc1.all_machines.cpu.all_cores.user 85.000000 1600255845
- The dropRaw parameter makes sure the original input data is discarded.
- The prefix parameter functionally doesn’t create any filtering in addition to the regular expression, but it improves the performance. Prefix checks can avoid running the regular expression needlessly if data doesn’t need to be kept out of the aggregator.
- The wait parameter is important. It should be set to the max time delay expected in the data. See the carbon-relay-ng aggregator documentation for more info.