Manage metrics costs via Adaptive Metrics
Note: Adaptive Metrics is currently in public preview. Submit a request here to gain access to this tool. Grafana Labs offers support on a best-effort basis, and breaking changes might occur prior to the feature being made generally available.
Adaptive Metrics consists of a recommendations service that generates suggested aggregation rules, and an aggregations service that consumes and implements those rules.
You can interact with both of these services via an HTTP API, a CLI tool, or both.
Note: The CLI and API are in early stages of development and therefore subject to change.
Supported metrics formats
While Grafana Cloud accepts metrics data in a variety of formats, Adaptive Metrics is only compatible with a subset of these formats:
Metrics format | Supported? | Notes |
---|---|---|
Prometheus | Yes | Fully supported. However, if you do not send metric metadata, few recommendations will be generated. Metric metadata is sent by default in newer versions of Prometheus and the Grafana Agent, but will not be sent if intentionally disabled or if running an older version where the default is to not send. |
OpenTelemetry | Yes | Recommendations are limited because metadata is not sent. |
Influx Line protocol | Yes | Recommendations are limited because metadata is not sent. |
Datadog | No | |
Graphite | No |
Note: Adaptive Metrics uses Prometheus metric metadata stored in your Grafana Hosted Metrics instance to ensure recommendations are safe to apply mathematically. For example, for a counter-type metric, recommendations by Adaptive Metrics ensure that counter resets are taken into account during aggregation. If metric metadata is not available for a metric, and Adaptive Metrics is unable to infer a metric’s type from its name or usage patterns, no recommendation will be produced for that metric. If you are using a metric format other than Prometheus, metric metadata is not preserved. As a result, there are fewer recommendations for those metrics.
Recommendations service
The recommendations service scans your Grafana Cloud account and identifies unused and partially used metrics based on your existing dashboards, recording rules, alerting rules, and the last 30 days of query history.
Unused metrics versus partially used metrics
- An unused metric is a metric that has not been queried in the last 30 days and is not used in an existing dashboard, recording rule, or alerting rule.
- A partially used metric has been queried at least once in the last 30 days or is used in at least one dashboard, alert, or recording rule. However, all usages only touch a subset of its labels; some of its labels are not used.
Based on the preceding analysis, the recommendations service generates a set of aggregation rules that you can apply using the aggregation service. By applying these recommended rules you can reduce the cardinality of your unused and partially used metrics and do so in a way that guarantees that your existing usages of that metric are respected. Your dashboards, rules, and previous queries that you ran will continue to work as before.
Understand the recommendations format
By default, recommendations are returned in the same format as aggregation rules. This allows the user to apply the recommended aggregations to the aggregations service with no additional editing required.
Where appropriate, the recommendations service also provides updated versions of existing rules. It omits suggestions for any existing rules that it recommends that you remove. The intent is that the recommendations file could then be compared to your existing list of rules to highlight differences between the current state and the recommended state.
An optional verbose flag (--verbose
) can be used to retrieve more information about each recommendation.
Here’s an example of a recommendations file when the --verbose
flag is added:
[
{
"metric": "multitenantproxy_sql_query_total",
"drop_labels": [
"container",
"instance",
"namespace",
"Pod"
],
"aggregations": [
"sum:counter"
],
"recommended_action": "keep",
"usages_in_rules": 0,
"usages_in_queries": 13,
"usages_in_dashboards": 98
},
{
"metric": "cortex_bucket_store_indexheader_lazy_load_duration_seconds_count",
"drop_labels": [
"component",
"container",
"instance",
"namespace",
"Pod"
],
"aggregations": [
"sum:counter"
],
"recommended_action": "remove",
"usages_in_rules": 0,
"usages_in_queries": 13,
"usages_in_dashboards": 98
},
{
"metric": "thanos_objstore_bucket_operation_duration_seconds_bucket",
"drop_labels": [
"bucket",
"container",
"instance",
"Pod"
],
"aggregations": [
"sum:counter"
],
"recommended_action": "add",
"keep_labels": [
"cluster",
"component",
"job",
"le",
"namespace",
"Operation"
],
"usages_in_rules": 0,
"usages_in_queries": 0,
"usages_in_dashboards": 98,
"total_series_before_aggregation": 50295,
"total_series_after_aggregation": 19425
},
{
"metric": "thanos_objstore_bucket_operation_duration_seconds_count",
"drop_labels": [
"bucket",
"instance",
"Pod"
],
"aggregations": [
"sum:counter"
],
"recommended_action": "update",
"usages_in_rules": 0,
"usages_in_queries": 0,
"usages_in_dashboards": 98
}
]
For an explanation of the default fields, see the Aggregation rule format,
where we explain the fields added by the --verbose
flag.
recommended_action
denotes the recommended change from your current rules. Valid values areadd
,keep
,remove
, andupdate
.add
: The aggregation service is recommending aggregating a metric that is not currently aggregated.keep
: The aggregation service is recommending that you keep an aggregation rule in place as-is, without modification.remove
: The aggregation service is recommending that you remove an aggregation rule that is currently in place. This happens due to changes detected in the usage of the aggregated metric in your environment.update
: The aggregation service is recommending that you update an aggregation rule that is currently in place, by modifying the labels being aggregated on that metric or the aggregation functions being computed. Likeremove
, this recommendation is also made based on changes detected in the usage of the aggregated metric in your environment.
usages_in_rules
is the number of times this metric was found in an alerting or recording rule.usages_in_queries
is the number of times this metric was found in the last 30 days of query logs.usages_in_dashboards
is the number of times this metric was found in Grafana dashboards.
In the case of an add
recommendation, more fields are present:
keep_labels
reflects the set of labels that will remain after this rule is applied.total_series_before_aggregation
is the number of series for this metric before aggregation.total_series_after_aggregation
is the estimated number of series for this metric after aggregation.
If you’re just getting started with Adaptive Metrics, none of your metrics should have aggregation applied. Every recommendation will have recommended_action
set to add
.
Note:Recommendations are based on a snapshot of the currently applied rules seen by the recommendation engine at the time it most recently ran. Currently, the recommendations engine runs once every 24 hours.
This means that the recommended rule set might be out of sync with the rules currently used by the aggregations service.
Aggregations service
The aggregations service provides a way for you to aggregate metrics into lower cardinality versions of themselves. Users can define and apply their own aggregation rules, or apply the rules recommended by the recommendations service.
Aggregation rule format
The following example shows an aggregation rule for the metric proxy_sql_queries_total
:
{
"metric": "proxy_sql_queries_total",
"drop_labels": [
"container",
"instance",
"namespace",
"pod"
],
"aggregations": [
"sum:counter"
]
}
A description of the fields follows:
metric
is the name of the metric to be aggregated.drop_labels
is a list of the labels that will be removed from the metric via aggregation.aggregations
is a list of the aggregation functions to apply to the metric.
Supported aggregation types
The following values are supported for the aggregations
field of an aggregation rule:
"sum:counter"
"sum"
"min"
"max"
"count"
Configure an aggregation
As an illustration, think of a power grid that monitors the energy consumption of houses on different city streets. An example metric that expresses building consumption could be electrical_throughput_total
with labels street_name
and building_number
. Given that you only care about maximum and minimum consumption at a per-street level (as opposed to detailed consumption data for every building), you could configure an aggregation rule as follows:
{
"metric": "electrical_throughput_total",
"drop_labels": [
"building_number"
],
"aggregations": [
"max",
"min"
]
}
Based on the preceding configuration, the aggregation service would discard the label building_number
from the aggregated metric electrical_throughput_total
.
In its place, it would compute and store aggregated values per street for this metric.
This means that it would compute and persist the maximum (max
) and minimum (min
) values of the electrical throughput on every street in the label set. The specific building or buildings that consumed the maximum and minimum amounts of electricity would no longer be identifiable.
Drop a metric
You can also configure an aggregation rule that causes the entire metric to be dropped. If you don’t want to persist any time series at all for electrical_throughput_total
, from the example in Configure an aggregation, you would configure a rule as follows:
{
"metric": "electrical_throughput_total",
"drop": true
}
This might be useful in cases where a metric originates in many different locations and it would be hard to configure every site of origin to drop the metric on the client side.
CLI workflow
Understand the high-level workflow with the CLI:
- Download recommendations of what metrics to aggregate.
- Use those recommendations to create your own set of aggregation rules.
- Upload that set of aggregation rules.
The CLI also enables you to view, edit, and delete existing aggregation rules that have already been applied.
Use the Adaptive Metrics CLI
To use the CLI tool, you’ll need the following key information:
URL
: In the formhttps://<your-grafana-cloud-prom-url>.grafana.net/
. To find yourURL
value, go to your grafana.com account and check the Details page of your hosted Prometheus endpoint.TENANT
: The numeric instance ID where Adaptive Metrics is set up. To find yourTENANT
value, go to your grafana.com account and check the Details page of your hosted Prometheus endpoint for Username / Instance ID.KEY
: An API key with the appropriate permissions. If you are using Grafana Cloud API keys, make sure thatKEY
is an API key with theAdmin
role. If you are using Grafana Cloud Access Policies, make sureKEY
is an API key withmetrics:read
andmetrics:write
scopes for the stack ID where you have enabled Adaptive Metrics.
Download the Adaptive Metrics CLI:
Go to the URL that is based on the build that corresponds to your platform:
https://dl.grafana.com/files/adaptive-cli/adaptive-cli.linux.amd64
SHA256 Sum:
3431da1dd9d1f041391c0646409882fe1e324ca52f44c019eeb8603c084a844e
https://dl.grafana.com/files/adaptive-cli/adaptive-cli.linux.arm64
SHA256 Sum:
93c30f8ced6c37e84b5e9946af170c751e0966fdf56ed6881d7a31b447263d73
https://dl.grafana.com/files/adaptive-cli/adaptive-cli.darwin.amd64
SHA256 Sum:
407ee409c6758af32065e475ddfc905cc520b84a2ef565fa912374c1866b8f61
https://dl.grafana.com/files/adaptive-cli/adaptive-cli.darwin.arm64
SHA256 Sum:
2a5379357a8ec6e1eeaec3d0b6c55d44762e02a06c4ad1d3b19226390fde040c
Launch the CLI using the following command:
./adaptive-cli.<your-distro> --user $TENANT --url $URL --password $KEY
Substitute the values outlined in the requirements section for
$TENANT
,$URL
, and$KEY
in the previous command.Use the
show recommendations
command to pull down the most recently generated recommendations from the recommendations service.
For built-in help documentation about the CLI tool, launch the tool in interactive mode (adding the --repl
flag) and then type --help
.
Example aggregation rule
Each aggregation rule looks similar to this:
{
"metric": "agent_request_duration_seconds_sum",
"drop_labels": [
"container",
"instance",
"method",
"namespace",
"pod",
"provider",
"status_code",
"ws"
],
"aggregations": [
"sum:counter"
]
}
In the preceding example:
metric
is the name of the metric to be aggregated.drop_labels
is an array of the labels that will be removed by the aggregations service.aggregations
is an array of the aggregation types to calculate for this metric. Only thesum:counter
type aggregations are supported.
You can use an aggregation rule file to define multiple rules simultaneously.
The following example rule file is an array of one or more aggregation rules:
[
{
"metric": "agent_request_duration_seconds_sum",
"drop_labels": [
"namespace",
"pod"
],
"aggregations": [
"sum:counter"
]
},
{
"metric": "prometheus_request_duration_seconds_sum",
"drop_labels": [
"container",
"instance",
"ws"
],
"aggregations": [
"sum:counter"
]
}
]
Apply aggregation rules
After you add (create aggregations
), modify (edit aggregations
), or delete (delete aggregations
) an aggregation rule, the CLI’s show aggregations
command reflects the change. Use this command to get the most current picture of which aggregation rules are active in your environment.
There is a delay between uploading new aggregation rules and those metrics aggregations taking effect in your environment. In most cases, the delay is approximately 5-10 minutes, but we currently have no mechanism to let you know precisely when new aggregations take effect.
You can query whatever metric you have added, or changed the aggregation rule for, and look at the value of the __dropped_labels__
label. After this value reflects the changes you’ve made, you’ll know your updated aggregation rules are live in your environment.
We currently limit how often new aggregation rules can be applied. Although you can upload as many new versions of your aggregation rules as you like, those updates are only applied once every 10 minutes. If you make multiple updates in quick succession, the system applies your first received (oldest) update. Then, 10 minutes later, the most recently received update is applied. The intermediate updates never get applied.
Adaptive Metrics API
The Adaptive Metrics CLI is a wrapper around an API. You can use the underlying API directly if you choose. This API is under active development and is subject to change.
List recommendations
Download our recommendations for metrics to aggregate using command below. KEY
and TENANT
are variables defined within the requirements section
curl -u "$TENANT:$KEY" "$URL/aggregations/recommendations"
KEY
must have metrics:read
scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY
must be for Admin
or Viewer
roles.
You can use an optional verbose flag to retrieve more information about each recommendation:
curl -u "$TENANT:$KEY" "$URL/aggregations/recommendations?verbose=true"
List current recommendations configuration
Download the current configuration of the recommendations service using the following command:
curl -u "$TENANT:$KEY" "$URL/aggregations/recommendations/config"
KEY
must have metrics:read
scope if using Grafana Cloud Access Policies. If using Grafana Cloud API keys, KEY
must be for Admin
or Viewer
roles.
The only tunable parameter exposed by the recommendations service is the keep_labels
parameter. This parameter allows the user to define a comma-separated list of labels that they never want recommended for aggregation. This can be useful at organizations where certain labels are always expected on metrics, regardless of whether or not those labels have been recently queried.
An example response from the /recommendations/config
endpoint would look as follows:
{
"keep_labels": [
"instance",
"pod",
]
}
The preceding response indicates that the recommendations service has been configured to never recommend aggregating the instance
or pod
labels.
Update recommendations configuration
Upload new recommendations configuration using the following command:
curl -u "$TENANT:$KEY" --request POST --data @config.json "$URL/aggregations/recommendations/config"
KEY
must have metrics:write
scope if using Grafana Cloud Access Policies. If using Grafana Cloud API keys, KEY
must be for Admin
or MetricsPublisher
roles.
This command uses the same endpoint described in List current recommendations configuration and expects the same JSON format.
List currently applied aggregation rules
Download your existing aggregation rules:
curl -u "$TENANT:$KEY" "$URL/aggregations/rules"
KEY
must have metrics:read
scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY
must be for Admin
or Viewer
roles.
Upload new aggregation rules
Uploading new aggregation rules is a multi-step process:
- Fetch the currently applied rules.
- Modify rules locally.
- Upload rules back.
Fetch the currently applied rules
Use this command:
curl -u "$TENANT:$KEY" -D headers.txt "$URL/aggregations/rules" > rules.json
KEY
must have metrics:read
scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY
must be for Admin
or Viewer
roles.
The preceding command uses the same endpoint described in List recommendations, but adds an additional -D headers.txt
argument.
The -D headers.txt
argument stores the headers in a file called headers.txt.
This step is required if you want to then upload a new rule file, for example if you want to update the existing aggregation rules you have in place. The information in these headers ensures there are no update collisions. An update collision is the scenario where multiple users try to edit the rules file at the same time and overwrite one another’s changes.
Modify the rules locally
Use your editor of choice to modify the rules.json
file downloaded in the prior step.
Upload rules back
The API supports uploading an entire rules file.
Warning: THIS ACTION WILL OVERWRITE YOUR EXISTING RULE FILE. If you prefer to append to your existing rules, you must use the CLI instead.
To upload your modified rules.json
file from the previous step, use the following shell script:
TMPFILE=$(mktemp)
trap 'rm "$TMPFILE"' EXIT
cat headers.txt | grep -i '^etag:' | sed 's/^ETag:/If-Match:/i' > "$TMPFILE"
curl --request POST --header @"$TMPFILE" --data-binary @$1 -u "$TENANT:$KEY" "$URL/aggregations/rules"
KEY
must have metrics:write
scope if you are using Grafana Cloud Access Policies. If you are using Grafana Cloud API keys, KEY
must be for Admin
or MetricsPublisher
roles.
The cat headers.txt
command modifies the headers.txt
file created in the previous curl call that pulled down the existing aggregation rules.
The curl --request POST
command uploads your new rules file, as well as the updated headers.
Save the shell script as rules_upload.sh.
To run that script, use the following command:
./rules_upload.sh <your_new_rules_file.json>
Replace <your_new_rules_file.json>
with the name of the rules file you wish to upload.
Aggregation service: requirements on sample age
We can only aggregate raw samples that are relatively recent. Grafana Cloud will reject samples for metrics being aggregated that arrive more than 60s delayed. If the difference between the wall clock time at which a sample arrives at Grafana Cloud and the timestamp on that sample (which indicates when it was collected) is greater than 60 seconds, Grafana Cloud will reject that sample.
If Grafana Cloud rejects samples for this reason, you will see an increase in forwarded-samples-too-old
errors on the Discarded Metrics Samples panel of your billing dashboard.
This sample age requirement only applies to samples that belong to metrics that are being aggregated.
Why this happens
To compute an aggregation, you must wait for all raw samples associated with that metric to arrive. We don’t know how many samples will arrive, nor can we wait indefinitely on those samples, because the longer we wait, the longer the delay in the data being queryable and/or visible in dashboards.
If a sample arrives after our configured waiting time, it does not get taken into account during the computation of the aggregated value. Because our metrics database is immutable once the aggregation has been computed, we cannot update the aggregated value to reflect this late arriving data point.
Troubleshooting
If you encounter issues querying a metric that has been aggregated, see Troubleshoot your aggregated metrics query. For any other questions or feedback, contact your Customer Success Manager or file a Support request.
Security warning when running the CLI on macOS
If you try to run the CLI on macOS and get a security warning that it can’t be opened because Apple cannot check it, perform the following steps:
- Open System Settings.
- Navigate to Privacy & Security.
- Scroll down to Security.
- Locate the option to run the CLI.