Menu
Grafana Cloud

Understand recommended rules for metrics aggregation

The recommendations service scans your Grafana Cloud account and identifies unused and partially used metrics based on your existing dashboards, recording rules, alerting rules, and the last 30 days of query history. Based on these findings, it creates recommended rules for metrics aggregation.

Unused metrics versus partially used metrics

  • An unused metric is a metric that has not been queried in the last 30 days and is not used in an existing dashboard, recording rule, or alerting rule.
  • A partially used metric has been queried at least once in the last 30 days or is used in at least one dashboard, alert, or recording rule. However, all usages only touch a subset of its labels; some of its labels are not used.

Based on the preceding analysis, the recommendations service generates a set of aggregation rules that you can apply using the aggregation service. By applying these recommended rules you can reduce the cardinality of your unused and partially used metrics and do so in a way that guarantees that your existing usages of that metric are respected. Your dashboards, rules, and previous queries that you ran will continue to work as before.

Recommendations format

By default, recommendations are returned in the same format as aggregation rules. This allows the user to apply the recommended aggregations to the aggregations service with no additional editing required.

The recommendations service can also recommend updated versions of existing rules, except for ones that it recommends you remove. The intent is that you can compare the recommendations file to your existing list of rules to see the differences between the current rule state and the recommended rule state.

For more information about each recommendation, you can use an optional verbose flag (--verbose).

Here’s an example of a recommendations file when the --verbose flag is added:

json
[
  {
    "metric": "multitenantproxy_sql_query_total",
    "drop_labels": ["container", "instance", "namespace", "Pod"],
    "aggregations": ["sum:counter"],
    "recommended_action": "keep",
    "usages_in_rules": 0,
    "usages_in_queries": 13,
    "usages_in_dashboards": 98
  },
  {
    "metric": "cortex_bucket_store_indexheader_lazy_load_duration_seconds_count",
    "drop_labels": ["component", "container", "instance", "namespace", "Pod"],
    "aggregations": ["sum:counter"],
    "recommended_action": "remove",
    "usages_in_rules": 0,
    "usages_in_queries": 13,
    "usages_in_dashboards": 98
  },
  {
    "metric": "thanos_objstore_bucket_operation_duration_seconds_bucket",
    "drop_labels": ["bucket", "container", "instance", "Pod"],
    "aggregations": ["sum:counter"],
    "recommended_action": "add",
    "kept_labels": ["cluster", "component", "job", "le", "namespace", "Operation"],
    "usages_in_rules": 0,
    "usages_in_queries": 0,
    "usages_in_dashboards": 98,
    "total_series_before_aggregation": 50295,
    "total_series_after_aggregation": 19425
  },
  {
    "metric": "thanos_objstore_bucket_operation_duration_seconds_count",
    "drop_labels": ["bucket", "instance", "Pod"],
    "aggregations": ["sum:counter"],
    "recommended_action": "update",
    "usages_in_rules": 0,
    "usages_in_queries": 0,
    "usages_in_dashboards": 98
  }
]

For an explanation of the default fields, see Aggregation rule format, where we explain the fields added by the --verbose flag.

  • recommended_action denotes the recommended change from your current rules. Valid values are add, keep, remove, and update.
    • add: The aggregation service is recommending aggregating a metric that is not currently aggregated.
    • keep: The aggregation service is recommending that you keep an aggregation rule in place as-is, without modification.
    • remove: The aggregation service is recommending that you remove an aggregation rule that is currently in place. This happens due to changes detected in the usage of the aggregated metric in your environment.
    • update: The aggregation service is recommending that you update an aggregation rule that is currently in place, by modifying the labels being aggregated on that metric or the aggregation functions being computed. Like remove, this recommendation is also made based on changes detected in the usage of the aggregated metric in your environment.
  • usages_in_rules is the number of times this metric was found in an alerting or recording rule.
  • usages_in_queries is the number of times this metric was found in the last 30 days of query logs.
  • usages_in_dashboards is the number of times this metric was found in Grafana dashboards.

In the case of an add recommendation, more fields are present:

  • kept_labels reflects the set of labels that will remain after this rule is applied.
  • total_series_before_aggregation is the number of series for this metric before aggregation.
  • total_series_after_aggregation is the estimated number of series for this metric after aggregation.

If you’re just getting started with Adaptive Metrics, none of your metrics have applied aggregations. Every recommendation will have recommended_action set to add.

Note

Recommendations are based on a snapshot of the currently applied rules seen by the recommendation engine at the time it most recently ran. This means that the recommended rule set might be out of sync with the rules currently used by the aggregations service.

Why am I not seeing any recommendations?

If you do not see any recommendations, check to see if the recommendations service has run. You can do this by sending an HTTP request to the /aggregations/recommendations endpoint and checking the Last-Modified header returned in the response. If you are using curl, use the -v flag to see the headers.

  • If the header is missing, then the recommendations service has not run. Recommendations are only generated for active Grafana Cloud instances. If you haven’t recently logged into your hosted Grafana, simply log in. This will trigger the recommendations service to generate a new set of recommendations. This usually only takes about an hour to complete, but can take up to 24 hours.

    If you recently logged into your hosted Grafana and your recommendations response is still empty and missing a Last-Modified header, then open a support ticket.

  • If the header is present, its value is the last time your recommendations were updated.

    If the recommendations service has zero recommendations, this means that upon analyzing your Grafana Cloud Metrics account, the recommendations service found no metrics that it judged to be candidates for aggregation. A metric can be marked “not a candidate for aggregation” for several reasons:

    • The metric is used in any recording or alerting alerting rule. Recording and alerting rules can be very sensitive to delays in when samples are received; aggregation introduces a short delay in when samples are ingested.
    • The metric’s cardinality is too low. Aggregation introduces some overhead, so it is not cost efficient to aggregate metrics of fewer than 100 time series.
    • It is not possible to aggregate the metric without breaking any dashboards, rules, or historic queries of that metric. This generally happens when the metric is used in a way that is mathematically incompatible with aggregation. Common cases include:
      • If a metric is used without any aggregation function, such as a metric_name{label="value"} query. When this happens, you end up needing all possible labels on that metric, and we cannot safely drop labels without changing the query result.
      • If a metric is used with a non-associative aggregation function, for example stddev. We only support aggregation functions that are associative.
    • The metric is a Prometheus summary-type metric. Summaries store percentiles and percentiles cannot be aggregated. For example, there is no way to calculate a global 95th percentile of latency given the set of 95th percentiles of latencies of individual components.

How often are recommendations updated?

The recommendations engine triggers when any of the following conditions are met:

  • A new usage pattern is detected in a Grafana Dashboard.
  • A new usage pattern is detected in a Recording or Alerting rule.
  • A new usage pattern is detected in a query log.
  • A change in the currently applied aggregation rules is detected.

Each of these conditions is checked once per hour to determine if new recommendations should be generated. If none of the preceding conditions are true, the recommendations engine triggers once the current set of recommendations is at least 12h old. This time-based trigger ensures that changes in the stored time series data are considered, even if usage patterns remain constant.

For the purpose of the Adaptive Metrics recommendations engine, usage patterns are considered new when any of the following conditions are true:

  • A metric is used with a new aggregation function
  • A metric is grouped by a new label
  • A metric is filtered by a new label matcher
  • A metric is used in a new location, e.g. when a metric that was previously only used in dashboards is now used in a recording rule.

After the recommendations engine is triggered, it can take some time to produce a new set of recommendations. Typically this is only a few minutes, but during peak times the job waits in a queue before being processed.

With all of the preceding points considered, you can expect that new usage patterns show up in recommendations within 2h and changes in the underlying data show up in recommendations within 24h.

Time when recommendations for aggregations were last generated

Recommendations are updated for each Grafana Cloud stack according to the preceding logic as long as the following statements are true:

  • The hosted Grafana instance provisioned with the stack has been recently used.
  • You’re sending at least one active time series to your Grafana Cloud Metrics account.

To understand when aggregation recommendations were last generated for your stack, check the Last-Modified header in the HTTP response of the /aggregations/recommendations endpoint.

Make the recommendations service ignore a query

The recommendations service collects usage information from your Grafana Cloud stack to ensure that the recommended aggregations honor your existing usage patterns. In some cases, you might want a certain query to be ignored when computing recommendations. One common case might be that you are simply inspecting a metric, but aren’t yet sure if you think it will be useful for observing your systems.

To do so, use an empty label selector for the __ignore_usage__ label in your PromQL query. For example, the query sum by (pod) (container_cpu_seconds_total) would become sum by (pod) (container_cpu_seconds_total{__ignore_usage__=""}). Adding this label selector will have no effect on the query results, but will signal to the recommendations service to not try to retain compatibility with that query. This means that it may recommend an aggregation that would result in that query no longer returning data.

If you query has multiple vector selectors, add the __ignore_usage__ label to all metrics. For example, the query sum by (pod) (request_duration_sum) / sum by (pod) (request_duration_count) would become sum by (pod) (request_duration_sum{__ignore_usage__=""}) / sum by (pod) (request_duration_count{__ignore_usage__=""}).

Queries ignored by Adaptive Metrics

Adaptive Metrics is able to analyze the vast majority of PromQL queries, but there are a few cases where queries are ignored.

  • The query does not contain a metric name. Queries like count({env="prod"}) don’t contain any metric name, and are ignored. These queries fail if any metric which match the vector selector has been aggregated. Depending on your goal, it may still be possible to rewrite the query in such a way that it still works. For more information, refer to Troubleshoot your aggregated metrics query.
  • The query matches all metrics. Queries like count({__name__=~".+"}) match all metrics and are incompatible with the transparent query support in Adaptive Metrics. These queries fail after enabling Adaptive Metrics, but can often be rewritten to a working form depending on your goals. For more information, refer to How can I count the aggregated series?.
  • The query makes use of dashboard variables. Grafana allows queries like sum($metric_total) which makes use of a $metric variable. When a dashboard variable is used to dynamically specify the metric name, Adaptive Metrics is unable to correctly attribute the query to a specific set of metrics. Note that any time the dashboard is loaded, the actual query still appears in query logs and is considered normally, so this is typically only an issue for dashboards which are unused. This limitation does not apply for the range variables like $__rate_interval, which are correctly analyzed.

Recommendations for metrics used in alerting or recording rules

Adaptive Metrics strongly recommends removing aggregations on metrics used in alerting or recording rules. Aggregation delay can lead to silent alerts or degradation of recording rules, potentially causing unnoticed incidents.

Caution

Using aggregations in alerting or recording rules is unsupported behavior and can lead to unexpected issues.