Menu
Grafana Cloud

Apply recommended rules for metrics aggregation

The aggregations service provides a way for you to aggregate metrics into lower cardinality versions of themselves. Users can define and apply their own aggregation rules, or apply the rules recommended by the recommendations service.

Aggregation rule format

The aggregations service expects the following format:

Field nameData typeDescription
metricstringThe metric name or metric name matcher to which the aggregation rule applies.
match_typestring (optional)The type of matching to be done against the value of the metric field. For valid values, see substring matchers. If you do not specify match_type, the value is exact.
dropbool (optional)If set to true, the entire metric is dropped instead of aggregated. If you set this to true, you cannot use the drop_labels and aggregations fields. If you do not specify drop, the value is false.
drop_labelsstring arrayThe list of labels that will be aggregated away; these labels will not be present in the aggregated metric. You can specify either drop_labels or keep_labels, but you can’t use both fields within the same rule.
keep_labelsstring arrayThe list of labels that will be retained. All labels not specified in the list will be dropped. You can specify either keep_labels or drop_labels, but you can’t use both fields within the same rule.
aggregationsstring arrayThe list of aggregation functions to apply to the metric or metrics that are matched by this rule. For valid values, see Supported aggregation types.
aggregation_intervalstring duration (optional)The interval of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_interval, you also need to specify aggregation_delay field.
aggregation_delaystring duration (optional)The time of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_delay, you also need to specify aggregation_interval field.

The following example shows an aggregation rule for the metric proxy_sql_queries_total:

json
{
  "metric": "proxy_sql_queries_total",
  "drop_labels": ["container", "instance", "namespace", "pod"],
  "aggregations": ["sum:counter"]
}

Supported aggregation types

The following values are supported for the aggregations field of an aggregation rule:

Aggregation functionDefinition
sum:counterThe running sum of all increases of raw series values. Applicable to counter type metrics, and correctly accounts for counter resets. A counter type metric is conceptually similar to elevation gain. For example, if a cyclist counts their elevation gain by peak, they can sum several peaks’ worth of elevation gain to understand how much they’ve climbed in total. The elevation gain for each peak over time is a raw series. If you specify the sum:counter aggregation with "drop_labels": ["peak"] for this metric, the per-peak raw series would be aggregated into one series that would tell the cyclist the total amount they climbed over time. From this aggregated data, they can no longer tell how much they have climbed in total for a given peak.
sumThe sum of all values across the aggregated series at a given time stamp. The sum aggregation is not useful for counter type metrics; for counter type metrics, use sum:counter instead.
minThe minimum of all values across all the aggregated series at a given time stamp.
maxThe maximum of all values across all the aggregated series at a given time stamp.
countThe number of raw series that feed into the aggregated series at a given time stamp.

Substring matchers

By default, a rule is applied to the metric name specified in the rule’s metric field. In addition, Adaptive Metrics allows you to write rules that apply to all metrics whose names match a given prefix or suffix. To apply rules to all such metrics, use the optional field match_type in your rule and set it to prefix or suffix.

The match_type field supports the following values:

  • exact: Apply the rule to the metric whose name is specified in the rule’s metric field. Because metric names are unique, the rule will only apply to one metric.
  • prefix: Apply the rule to all metrics whose names start with the string in the rule’s metric field.
  • suffix: Apply the rule to all metrics whose names end with the string in the rule’s metric field.

An example rule that matches all metrics beginning with http_requests_total_, and that aggregates away their instance label using the sum:counter function, looks as follows:

json
{
  "metric": "http_requests_total_",
  "match_type": "prefix",
  "drop_labels": ["instance"],
  "aggregations": ["sum:counter"]
}

In such scenario, the metric http_requests_total_abc has two rules that potentially apply. However, because an exact match has precedence over a prefix match, both the instance and pod labels would be aggregated away for http_requests_total_abc:

json
[
  {
    "metric": "http_requests_total_",
    "match_type": "prefix",
    "drop_labels": ["instance"],
    "aggregations": ["sum:counter"]
  },
  {
    "metric": "http_requests_total_abc",
    "drop_labels": ["instance", "pod"],
    "aggregations": ["sum:counter"]
  }
]

If multiple substring matchers match a metric, the first match always wins. Consider a rule file with the following two rules:

json
[
  {
    "metric": "http_requests_total_",
    "match_type": "prefix",
    "drop_labels": ["instance"],
    "aggregations": ["sum:counter"]
  },
  {
    "metric": "_abc",
    "match_type": "suffix",
    "drop_labels": ["pod"],
    "aggregations": ["sum:counter"]
  }
]

In this scenario, the metric http_requests_total_abc is matched by both rules. Because neither rule is an exact match, the first rule in the list takes precedence. This means that the instance label, not the pod label is aggregated away for http_requests_total_abc.

Configure an aggregation

As an illustration, think of a power grid that monitors the energy consumption of houses on different city streets. An example metric that expresses building consumption could be electrical_throughput_total with labels street_name and building_number. Given that you only care about the total energy consumption per street and the average consumption per building on a street, you could configure two aggregations where one sums the consumption of all buildings in a street and the other counts the buildings of the street.

Since the metric electrical_throughput_total is a counter, we’d need to use the sum:counter aggregation (instead of the sum aggregation) to handle counter resets correctly:

json
{
  "metric": "electrical_throughput_total",
  "drop_labels": ["building_number"],
  "aggregations": ["sum:counter", "count"]
}

Based on the preceding configuration, the aggregation service would discard the label building_number from the aggregated metric electrical_throughput_total. In its place, it would compute and store aggregated values per street for this metric.

The sum:counter aggregation function computes the total electrical throughput of every street in the street_name label set. The count aggregation function computes the count of buildings per street. These two values can be used to compute an average consumption per building for each street.

However, because the building_number label has been discarded, it is no longer possible to understand how much power a specific building consumes.

Limits on the aggregation service

The Adaptive Metrics aggregation service enforces limits on the number of series than can be aggregated. If this limit is exceeded, the aggregation service will begin to discard incoming samples.

When this happens, you will see an increase in aggregator-too-many-aggregated-series errors in the Discarded Metrics Samples panel of your billing dashboard.

If you are hitting this limit and would like to request an increase, contact Grafana Labs Support.

Drop a metric

You can also configure an aggregation rule that causes the entire metric to be dropped. If you don’t want to persist any time series at all for electrical_throughput_total, from the example in Configure an aggregation, you would configure a rule as follows:

json
{
  "metric": "electrical_throughput_total",
  "drop": true
}

This might be useful in cases where a metric originates in many different locations and it would be hard to configure every site of origin to drop the metric on the client side.

Note: Generally, aggregation is more favorable than dropping a metric entirely. By aggregating a metric, you can usually reduce its cardinality by 80-90%, and in the database keep some reference to it, such as a lower-fidelity version of it. This can be useful during the investigation of an incident. If you drop a metric, you reduce costs a bit more, but you eliminate all traces of the metric. This means that you do not see this metric when looking in the metric-name browser in Grafana Explore.

Configure the aggregation interval and the DPM of the aggregated metric

The number of data points per minute (DPM) that are stored for the aggregated metric depends directly on the aggregation interval of the metric, which is the interval at which the aggregated samples are emitted.

To configure the aggregation interval per aggregation rule, you can specify the following two optional parameters:

  • aggregation_interval is the interval at which the aggregated samples are emitted. The default value is 30s, which results in a DPM of 2. The valid values for aggregation_interval are: 15s, 30s and 60s.
  • aggregation_delay is the delay after which the aggregated samples are emitted. The default value is 90s. The valid values for the aggregation_delay are: 15s, 30s, 60s, 1m30s, 2m, 2m30s and 3m.

Either set both fields or leave them both empty.

If you want to reduce the DPM of the aggregated metric to 1, set the aggregation_interval to 1m. You do not need to change the aggregation_delay setting for that, but you need to specify it explicitly in the configuration. Set aggregation_delay to 90s to keep the default value.

You can decrease the aggregation_delay in order to emit the aggregated samples earlier, at the risk of excluding the samples that are written late (because of a lagging remote write client, for example). The total delay between the time of the raw sample arriving at Grafana Cloud and the time that the aggregated sample becomes queryable is the sum of the aggregation_interval and the aggregation_delay.

Note: You can set global values for aggregation_interval and aggregation_delay that will apply by default to all aggregation rules as a configuration option for your metrics instance. Open a support ticket in the Cloud Portal to request this.