Menu
Grafana Cloud Data configuration Metrics Manage metrics costs via Adaptive Metrics
Grafana Cloud

Manage metrics costs via Adaptive Metrics

Note: Adaptive Metrics is currently in public preview. Submit a request here to gain access to this tool. Grafana Labs offers support on a best-effort basis, and breaking changes might occur prior to the feature being made generally available.

Adaptive Metrics consists of a recommendations service that generates suggested aggregation rules, and an aggregations service that consumes and implements those rules.

You can interact with both of these services via an HTTP API, a CLI tool, or both.

Note: The CLI and API are in early stages of development and therefore subject to change.

Supported metrics formats

While Grafana Cloud accepts metrics data in a variety of formats, Adaptive Metrics is only compatible with a subset of these formats:

Metrics formatSupported?Notes
PrometheusYesFully supported. However, if you do not send metric metadata, few recommendations will be generated. Metric metadata is sent by default in newer versions of Prometheus and the Grafana Agent, but will not be sent if intentionally disabled or if running an older version where the default is to not send.
OpenTelemetryYesRecommendations are limited because metadata is not sent.
Influx Line protocolYesRecommendations are limited because metadata is not sent.
DatadogNo
GraphiteNo
Note: Adaptive Metrics uses Prometheus metric metadata stored in your Grafana Hosted Metrics instance to ensure recommendations are safe to apply mathematically. For example, for a counter-type metric, recommendations by Adaptive Metrics ensure that counter resets are taken into account during aggregation. If metric metadata is not available for a metric, and Adaptive Metrics is unable to infer a metric’s type from its name or usage patterns, no recommendation will be produced for that metric. If you are using a metric format other than Prometheus, metric metadata is not preserved. As a result, there are fewer recommendations for those metrics.

Recommendations service

The recommendations service scans your Grafana Cloud account and identifies unused and partially used metrics based on your existing dashboards, recording rules, alerting rules, and the last 30 days of query history.

Unused metrics versus partially used metrics

  • An unused metric is a metric that has not been queried in the last 30 days and is not used in an existing dashboard, recording rule, or alerting rule.
  • A partially used metric has been queried at least once in the last 30 days or is used in at least one dashboard, alert, or recording rule. However, all usages only touch a subset of its labels; some of its labels are not used.

Based on the preceding analysis, the recommendations service generates a set of aggregation rules that you can apply using the aggregation service. By applying these recommended rules you can reduce the cardinality of your unused and partially used metrics and do so in a way that guarantees that your existing usages of that metric are respected. Your dashboards, rules, and previous queries that you ran will continue to work as before.

Understand the recommendations format

By default, recommendations are returned in the same format as aggregation rules. This allows the user to apply the recommended aggregations to the aggregations service with no additional editing required.

Where appropriate, the recommendations service also provides updated versions of existing rules. It omits suggestions for any existing rules that it recommends that you remove. The intent is that the recommendations file could then be compared to your existing list of rules to highlight differences between the current state and the recommended state.

An optional verbose flag (--verbose) can be used to retrieve more information about each recommendation.

Here’s an example of a recommendations file when the --verbose flag is added:

[
	{
  	"metric": "multitenantproxy_sql_query_total",
  	"drop_labels": [
    	"container",
    	"instance",
    	"namespace",
    	"Pod"
  	],
  	"aggregations": [
    		"sum:counter"
  	],
	"recommended_action": "keep",
  	"usages_in_rules": 0,
  	"usages_in_queries": 13,
  	"usages_in_dashboards": 98
	},
	{
  	"metric": "cortex_bucket_store_indexheader_lazy_load_duration_seconds_count",
  	"drop_labels": [
    	"component",
    	"container",
    	"instance",
    	"namespace",
    	"Pod"
  	],
  	"aggregations": [
    		"sum:counter"
  	],
	"recommended_action": "remove",
  	"usages_in_rules": 0,
  	"usages_in_queries": 13,
  	"usages_in_dashboards": 98
	},
	{
  	"metric": "thanos_objstore_bucket_operation_duration_seconds_bucket",
  	"drop_labels": [
    	"bucket",
    	"container",
    	"instance",
    	"Pod"
  	],
  	"aggregations": [
    		"sum:counter"
  	],
	"recommended_action": "add",
  	"keep_labels": [
    	"cluster",
    	"component",
    	"job",
    	"le",
    	"namespace",
    	"Operation"
  	],
  	"usages_in_rules": 0,
  	"usages_in_queries": 0,
  	"usages_in_dashboards": 98,
  	"total_series_before_aggregation": 50295,
  	"total_series_after_aggregation": 19425
	},
	{
  	"metric": "thanos_objstore_bucket_operation_duration_seconds_count",
  	"drop_labels": [
    	"bucket",
    	"instance",
    	"Pod"
  	],
  	"aggregations": [
    	"sum:counter"
  	],
	"recommended_action": "update",
  	"usages_in_rules": 0,
  	"usages_in_queries": 0,
  	"usages_in_dashboards": 98
	}
]

For an explanation of the default fields, see the Aggregation rule format, where we explain the fields added by the --verbose flag.

  • recommended_action denotes the recommended change from your current rules. Valid values are add, keep, remove, and update.
    • add: The aggregation service is recommending aggregating a metric that is not currently aggregated.
    • keep: The aggregation service is recommending that you keep an aggregation rule in place as-is, without modification.
    • remove: The aggregation service is recommending that you remove an aggregation rule that is currently in place. This happens due to changes detected in the usage of the aggregated metric in your environment.
    • update: The aggregation service is recommending that you update an aggregation rule that is currently in place, by modifying the labels being aggregated on that metric or the aggregation functions being computed. Like remove, this recommendation is also made based on changes detected in the usage of the aggregated metric in your environment.
  • usages_in_rules is the number of times this metric was found in an alerting or recording rule.
  • usages_in_queries is the number of times this metric was found in the last 30 days of query logs.
  • usages_in_dashboards is the number of times this metric was found in Grafana dashboards.

In the case of an add recommendation, more fields are present:

  • keep_labels reflects the set of labels that will remain after this rule is applied.
  • total_series_before_aggregation is the number of series for this metric before aggregation.
  • total_series_after_aggregation is the estimated number of series for this metric after aggregation.

If you’re just getting started with Adaptive Metrics, none of your metrics should have aggregation applied. Every recommendation will have recommended_action set to add.

Note:

Recommendations are based on a snapshot of the currently applied rules seen by the recommendation engine at the time it most recently ran. Currently, the recommendations engine runs once every 24 hours.

This means that the recommended rule set might be out of sync with the rules currently used by the aggregations service.

Aggregations service

The aggregations service provides a way for you to aggregate metrics into lower cardinality versions of themselves. Users can define and apply their own aggregation rules, or apply the rules recommended by the recommendations service.

Aggregation rule format

The following example shows an aggregation rule for the metric proxy_sql_queries_total:

{
"metric": "proxy_sql_queries_total",
"drop_labels": [
    "container",
    "instance",
    "namespace",
    "pod"
],
"aggregations": [
    "sum:counter"
]
}

A description of the fields follows:

  • metric is the name of the metric to be aggregated.
  • drop_labels is a list of the labels that will be removed from the metric via aggregation.
  • aggregations is a list of the aggregation functions to apply to the metric.

Supported aggregation types

The following values are supported for the aggregations field of an aggregation rule:

"sum:counter"
"sum"
"min"
"max"
"count"

Configure an aggregation

As an illustration, think of a power grid that monitors the energy consumption of houses on different city streets. An example metric that expresses building consumption could be electrical_throughput_total with labels street_name and building_number. Given that you only care about maximum and minimum consumption at a per-street level (as opposed to detailed consumption data for every building), you could configure an aggregation rule as follows:

{
	"metric": "electrical_throughput_total",
	"drop_labels": [
		"building_number"
	],
	"aggregations": [
		"max",
		"min"
	]
}

Based on the preceding configuration, the aggregation service would discard the label building_number from the aggregated metric electrical_throughput_total. In its place, it would compute and store aggregated values per street for this metric. This means that it would compute and persist the maximum (max) and minimum (min) values of the electrical throughput on every street in the label set. The specific building or buildings that consumed the maximum and minimum amounts of electricity would no longer be identifiable.

Drop a metric

You can also configure an aggregation rule that causes the entire metric to be dropped. If you don’t want to persist any time series at all for electrical_throughput_total, from the example in Configure an aggregation, you would configure a rule as follows:

{
	"metric": "electrical_throughput_total",
	"drop": true
}

This might be useful in cases where a metric originates in many different locations and it would be hard to configure every site of origin to drop the metric on the client side.

CLI workflow

Understand the high-level workflow with the CLI:

  1. Download recommendations of what metrics to aggregate.
  2. Use those recommendations to create your own set of aggregation rules.
  3. Upload that set of aggregation rules.

The CLI also enables you to view, edit, and delete existing aggregation rules that have already been applied.

Use the Adaptive Metrics CLI

To use the CLI tool, you’ll need the following key information:

  • URL: In the form https://<your-grafana-cloud-prom-url>.grafana.net/. To find your URL value, go to your grafana.com account and check the Details page of your hosted Prometheus endpoint.
  • TENANT: The numeric instance ID where Adaptive Metrics is set up. To find your TENANT value, go to your grafana.com account and check the Details page of your hosted Prometheus endpoint for Username / Instance ID.
  • KEY: An API key with the appropriate permissions. If you are using Grafana Cloud API keys, make sure that KEY is an API key with the Admin role. If you are using Grafana Cloud Access Policies, make sure KEY is an API key with metrics:read and metrics:write scopes for the stack ID where you have enabled Adaptive Metrics.
  1. Download the Adaptive Metrics CLI:

    Go to the URL that is based on the build that corresponds to your platform:

  2. Launch the CLI using the following command:

    ./adaptive-cli.<your-distro> --user $TENANT --url $URL --password $KEY

    Substitute the values outlined in the requirements section for $TENANT, $URL, and $KEY in the previous command.

  3. Use the show recommendations command to pull down the most recently generated recommendations from the recommendations service.

For built-in help documentation about the CLI tool, launch the tool in interactive mode (adding the --repl flag) and then type --help.

Example aggregation rule

Each aggregation rule looks similar to this:

  {
    "metric": "agent_request_duration_seconds_sum",
    "drop_labels": [
      "container",
      "instance",
      "method",
      "namespace",
      "pod",
      "provider",
      "status_code",
      "ws"
    ],
     "aggregations": [
    	"sum:counter"
    ]
  }

In the preceding example:

  • metric is the name of the metric to be aggregated.
  • drop_labels is an array of the labels that will be removed by the aggregations service.
  • aggregations is an array of the aggregation types to calculate for this metric. Only the sum:counter type aggregations are supported.

You can use an aggregation rule file to define multiple rules simultaneously.

The following example rule file is an array of one or more aggregation rules:

[
 {
    "metric": "agent_request_duration_seconds_sum",
    "drop_labels": [
      "namespace",
      "pod"
    ],
     "aggregations": [
    	"sum:counter"
    ]
  },

 {
    "metric": "prometheus_request_duration_seconds_sum",
    "drop_labels": [
      "container",
      "instance",
      "ws"
    ],
     "aggregations": [
    	"sum:counter"
    ]
  }
]

Apply aggregation rules

After you add (create aggregations), modify (edit aggregations), or delete (delete aggregations) an aggregation rule, the CLI’s show aggregations command reflects the change. Use this command to get the most current picture of which aggregation rules are active in your environment.

There is a delay between uploading new aggregation rules and those metrics aggregations taking effect in your environment. In most cases, the delay is approximately 5-10 minutes, but we currently have no mechanism to let you know precisely when new aggregations take effect.

You can query whatever metric you have added, or changed the aggregation rule for, and look at the value of the __dropped_labels__ label. After this value reflects the changes you’ve made, you’ll know your updated aggregation rules are live in your environment.

We currently limit how often new aggregation rules can be applied. Although you can upload as many new versions of your aggregation rules as you like, those updates are only applied once every 10 minutes. If you make multiple updates in quick succession, the system applies your first received (oldest) update. Then, 10 minutes later, the most recently received update is applied. The intermediate updates never get applied.

Adaptive Metrics API

The Adaptive Metrics CLI is a wrapper around an API. You can use the underlying API directly if you choose. This API is under active development and is subject to change.

List recommendations

Download our recommendations for metrics to aggregate using command below. KEY and TENANT are variables defined within the requirements section

curl -u "$TENANT:$KEY" "$URL/aggregations/recommendations"

KEY must have metrics:read scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY must be for Admin or Viewer roles.

You can use an optional verbose flag to retrieve more information about each recommendation:

curl -u "$TENANT:$KEY" "$URL/aggregations/recommendations?verbose=true"

List current recommendations configuration

Download the current configuration of the recommendations service using the following command:

curl -u "$TENANT:$KEY" "$URL/aggregations/recommendations/config"

KEY must have metrics:read scope if using Grafana Cloud Access Policies. If using Grafana Cloud API keys, KEY must be for Admin or Viewer roles.

The only tunable parameter exposed by the recommendations service is the keep_labels parameter. This parameter allows the user to define a comma-separated list of labels that they never want recommended for aggregation. This can be useful at organizations where certain labels are always expected on metrics, regardless of whether or not those labels have been recently queried.

An example response from the /recommendations/config endpoint would look as follows:

{
  "keep_labels": [
    "instance",
    "pod",
  ]
}

The preceding response indicates that the recommendations service has been configured to never recommend aggregating the instance or pod labels.

Update recommendations configuration

Upload new recommendations configuration using the following command:

curl -u "$TENANT:$KEY" --request POST --data @config.json "$URL/aggregations/recommendations/config"

KEY must have metrics:write scope if using Grafana Cloud Access Policies. If using Grafana Cloud API keys, KEY must be for Admin or MetricsPublisher roles.

This command uses the same endpoint described in List current recommendations configuration and expects the same JSON format.

List currently applied aggregation rules

Download your existing aggregation rules:

curl -u "$TENANT:$KEY" "$URL/aggregations/rules"

KEY must have metrics:read scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY must be for Admin or Viewer roles.

Upload new aggregation rules

Uploading new aggregation rules is a multi-step process:

  1. Fetch the currently applied rules.
  2. Modify rules locally.
  3. Upload rules back.

Fetch the currently applied rules

Use this command:

curl -u "$TENANT:$KEY" -D headers.txt "$URL/aggregations/rules" > rules.json

KEY must have metrics:read scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY must be for Admin or Viewer roles.

The preceding command uses the same endpoint described in List recommendations, but adds an additional -D headers.txt argument.

The -D headers.txt argument stores the headers in a file called headers.txt. This step is required if you want to then upload a new rule file, for example if you want to update the existing aggregation rules you have in place. The information in these headers ensures there are no update collisions. An update collision is the scenario where multiple users try to edit the rules file at the same time and overwrite one another’s changes.

Modify the rules locally

Use your editor of choice to modify the rules.json file downloaded in the prior step.

Upload rules back

The API supports uploading an entire rules file.

Warning: THIS ACTION WILL OVERWRITE YOUR EXISTING RULE FILE. If you prefer to append to your existing rules, you must use the CLI instead.

To upload your modified rules.json file from the previous step, use the following shell script:

TMPFILE=$(mktemp)
trap 'rm "$TMPFILE"' EXIT

cat headers.txt | grep -i '^etag:' | sed 's/^ETag:/If-Match:/i' > "$TMPFILE"

curl --request POST --header @"$TMPFILE" --data-binary @$1 -u "$TENANT:$KEY" "$URL/aggregations/rules"

KEY must have metrics:write scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY must be for Admin or MetricsPublisher roles.

The cat headers.txt command modifies the headers.txt file created in the previous curl call that pulled down the existing aggregation rules.

The curl --request POST command uploads your new rules file, as well as the updated headers.

Save the shell script as rules_upload.sh.

To run that script, use the following command:

./rules_upload.sh <your_new_rules_file.json>

Replace <your_new_rules_file.json> with the name of the rules file you wish to upload.

Aggregation service: requirements on sample age

We can only aggregate raw samples that are relatively recent. Grafana Cloud will reject samples for metrics being aggregated that arrive more than 60s delayed. If the difference between the wall clock time at which a sample arrives at Grafana Cloud and the timestamp on that sample (which indicates when it was collected) is greater than 60 seconds, Grafana Cloud will reject that sample.

If Grafana Cloud rejects samples for this reason, you will see an increase in forwarded-samples-too-old errors on the Discarded Metrics Samples panel of your billing dashboard.

This sample age requirement only applies to samples that belong to metrics that are being aggregated.

Why this happens

To compute an aggregation, you must wait for all raw samples associated with that metric to arrive. We don’t know how many samples will arrive, nor can we wait indefinitely on those samples, because the longer we wait, the longer the delay in the data being queryable and/or visible in dashboards.

If a sample arrives after our configured waiting time, it does not get taken into account during the computation of the aggregated value. Because our metrics database is immutable once the aggregation has been computed, we cannot update the aggregated value to reflect this late arriving data point.

Troubleshooting

If you encounter issues querying a metric that has been aggregated, see Troubleshoot your aggregated metrics query. For any other questions or feedback, contact your Customer Success Manager or file a Support request.

Security warning when running the CLI on macOS

If you try to run the CLI on macOS and get a security warning that it can’t be opened because Apple cannot check it, perform the following steps:

  1. Open System Settings.
  2. Navigate to Privacy & Security.
  3. Scroll down to Security.
  4. Locate the option to run the CLI.