Apply recommended rules for metrics aggregation
The aggregations service provides a way for you to aggregate metrics into lower cardinality versions of themselves. Users can define and apply their own aggregation rules, or apply the rules recommended by the recommendations service.
Aggregation rule format
The aggregations service expects the following format:
Field name | Data type | Description |
---|---|---|
metric | string | The metric name or metric name matcher to which the aggregation rule applies. |
match_type | string (optional) | The type of matching to be done against the value of the metric field. For valid values, see substring matchers. If you do not specify match_type , the value is exact . |
drop | bool (optional) | If set to true , the entire metric is dropped instead of aggregated. If you set this to true , you cannot use the drop_labels and aggregations fields. If you do not specify drop , the value is false . |
drop_labels | string array | The list of labels that will be aggregated away; these labels will not be present in the aggregated metric. You can specify either drop_labels or keep_labels , but you can’t use both fields within the same rule. |
keep_labels | string array | The list of labels that will be retained. All labels not specified in the list will be dropped. You can specify either keep_labels or drop_labels , but you can’t use both fields within the same rule. |
aggregations | string array | The list of aggregation functions to apply to the metric or metrics that are matched by this rule. For valid values, see Supported aggregation types. |
aggregation_interval | string duration (optional) | The interval of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_interval , you also need to specify aggregation_delay field. |
aggregation_delay | string duration (optional) | The time of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_delay , you also need to specify aggregation_interval field. |
The following example shows an aggregation rule for the metric proxy_sql_queries_total
:
{
"metric": "proxy_sql_queries_total",
"drop_labels": ["container", "instance", "namespace", "pod"],
"aggregations": ["sum:counter"]
}
Supported aggregation types
The following values are supported for the aggregations
field of an aggregation rule:
Aggregation function | Definition |
---|---|
sum:counter | The running sum of all increases of raw series values. Applicable to counter type metrics, and correctly accounts for counter resets. A counter type metric is conceptually similar to elevation gain. For example, if a cyclist counts their elevation gain by peak, they can sum several peaks’ worth of elevation gain to understand how much they’ve climbed in total. The elevation gain for each peak over time is a raw series. If you specify the sum:counter aggregation with "drop_labels": ["peak"] for this metric, the per-peak raw series would be aggregated into one series that would tell the cyclist the total amount they climbed over time. From this aggregated data, they can no longer tell how much they have climbed in total for a given peak. |
sum | The sum of all values across the aggregated series at a given time stamp. The sum aggregation is not useful for counter type metrics; for counter type metrics, use sum:counter instead. |
min | The minimum of all values across all the aggregated series at a given time stamp. |
max | The maximum of all values across all the aggregated series at a given time stamp. |
count | The number of raw series that feed into the aggregated series at a given time stamp. |
Substring matchers
By default, a rule is applied to the metric name specified in the rule’s metric
field.
In addition, Adaptive Metrics allows you to write rules that apply to all metrics whose names match a given prefix or suffix.
To apply rules to all such metrics, use the optional field match_type
in your rule and set it to prefix
or suffix
.
The match_type
field supports the following values:
exact
: Apply the rule to the metric whose name is specified in the rule’smetric
field. Because metric names are unique, the rule will only apply to one metric.prefix
: Apply the rule to all metrics whose names start with the string in the rule’smetric
field.suffix
: Apply the rule to all metrics whose names end with the string in the rule’smetric
field.
An example rule that matches all metrics beginning with http_requests_total_
, and that aggregates away their instance
label using the sum:counter
function, looks as follows:
{
"metric": "http_requests_total_",
"match_type": "prefix",
"drop_labels": ["instance"],
"aggregations": ["sum:counter"]
}
In such scenario, the metric http_requests_total_abc
has two rules that potentially apply. However, because an exact match has precedence over a prefix match, both the instance
and pod
labels would be aggregated away for http_requests_total_abc
:
[
{
"metric": "http_requests_total_",
"match_type": "prefix",
"drop_labels": ["instance"],
"aggregations": ["sum:counter"]
},
{
"metric": "http_requests_total_abc",
"drop_labels": ["instance", "pod"],
"aggregations": ["sum:counter"]
}
]
If multiple substring matchers match a metric, the first match always wins. Consider a rule file with the following two rules:
[
{
"metric": "http_requests_total_",
"match_type": "prefix",
"drop_labels": ["instance"],
"aggregations": ["sum:counter"]
},
{
"metric": "_abc",
"match_type": "suffix",
"drop_labels": ["pod"],
"aggregations": ["sum:counter"]
}
]
In this scenario, the metric http_requests_total_abc
is matched by both rules.
Because neither rule is an exact match, the first rule in the list takes precedence.
This means that the instance
label, not the pod
label is aggregated away for http_requests_total_abc
.
Configure an aggregation
As an illustration, think of a power grid that monitors the energy consumption of houses on different city streets. An example metric that expresses building consumption could be electrical_throughput_total
with labels street_name
and building_number
. Given that you only care about the total energy consumption per street and the average consumption per building on a street,
you could configure two aggregations where one sums the consumption of all buildings in a street and the other counts the buildings of the street.
Since the metric electrical_throughput_total
is a counter, we’d need to use the sum:counter
aggregation (instead of the sum
aggregation) to handle counter resets correctly:
{
"metric": "electrical_throughput_total",
"drop_labels": ["building_number"],
"aggregations": ["sum:counter", "count"]
}
Based on the preceding configuration, the aggregation service would discard the label building_number
from the aggregated metric electrical_throughput_total
.
In its place, it would compute and store aggregated values per street for this metric.
The sum:counter
aggregation function computes the total electrical throughput of every street in the street_name
label set. The count
aggregation function computes the count of buildings per street. These two values can be used to compute an average consumption per building for each street.
However, because the building_number
label has been discarded, it is no longer possible to understand how much power a specific building consumes.
Limits on the aggregation service
The Adaptive Metrics aggregation service enforces limits on the number of series than can be aggregated. If this limit is exceeded, the aggregation service will begin to discard incoming samples.
When this happens, you will see an increase in aggregator-too-many-aggregated-series
errors in the Discarded Metrics Samples panel of your billing dashboard.
If you are hitting this limit and would like to request an increase, contact Grafana Labs Support.
Drop a metric
You can also configure an aggregation rule that causes the entire metric to be dropped. If you don’t want to persist any time series at all for electrical_throughput_total
, from the example in Configure an aggregation, you would configure a rule as follows:
{
"metric": "electrical_throughput_total",
"drop": true
}
This might be useful in cases where a metric originates in many different locations and it would be hard to configure every site of origin to drop the metric on the client side.
Note: Generally, aggregation is more favorable than dropping a metric entirely. By aggregating a metric, you can usually reduce its cardinality by 80-90%, and in the database keep some reference to it, such as a lower-fidelity version of it. This can be useful during the investigation of an incident. If you drop a metric, you reduce costs a bit more, but you eliminate all traces of the metric. This means that you do not see this metric when looking in the metric-name browser in Grafana Explore.
Configure the aggregation interval and the DPM of the aggregated metric
The number of data points per minute (DPM) that are stored for the aggregated metric depends directly on the aggregation interval of the metric, which is the interval at which the aggregated samples are emitted.
To configure the aggregation interval per aggregation rule, you can specify the following two optional parameters:
aggregation_interval
is the interval at which the aggregated samples are emitted. The default value is30s
, which results in a DPM of2
. The valid values foraggregation_interval
are:15s
,30s
and60s
.aggregation_delay
is the delay after which the aggregated samples are emitted. The default value is90s
. The valid values for theaggregation_delay
are:15s
,30s
,60s
,1m30s
,2m
,2m30s
and3m
.
Either set both fields or leave them both empty.
If you want to reduce the DPM of the aggregated metric to 1
, set the aggregation_interval
to 1m
. You do not need to change the aggregation_delay
setting for that, but you need to specify it explicitly in the configuration. Set aggregation_delay
to 90s
to keep the default value.
You can decrease the aggregation_delay
in order to emit the aggregated samples earlier, at the risk of excluding the samples that are written late (because of a lagging remote write client, for example).
The total delay between the time of the raw sample arriving at Grafana Cloud and the time that the aggregated sample becomes queryable is the sum of the aggregation_interval
and the aggregation_delay
.
Note: You can set global values foraggregation_interval
andaggregation_delay
that will apply by default to all aggregation rules as a configuration option for your metrics instance. Open a support ticket in the Cloud Portal to request this.
Related resources from Grafana Labs


