This is documentation for the next version of Tempo. For the latest stable release, go to the latest version.
TraceQL metrics functions
TraceQL supports rate
, count_over_time
, min_over_time
, avg_over_time
, quantile_over_time
, histogram_over_time
, and compare
functions.
Available functions
These functions can be added as an operator at the end of any TraceQL query.
rate
- Calculates the number of matching spans per second
count_over_time
- Counts the number of matching spans per time interval (refer to the
step
API parameter). min_over_time
- Returns the minimum value for the specified attribute across all matching spans per time interval (refer to the
step
API parameter). max_over_time
- Returns the maximum value for the specified attribute across all matching spans per time interval (refer to the
step
API parameter). avg_over_time
- Returns the average value for the specified attribute across all matching spans per time interval (refer to the
step
API parameter). quantile_over_time
- The quantile of the values in the specified interval
histogram_over_time
- Evaluate frequency distribution over time. Example:
histogram_over_time(duration) by (span.foo)
compare
- Used to split the stream of spans into two groups: a selection and a baseline. The function returns time-series for all attributes found on the spans to highlight the differences between the two groups.
The rate
function
The rate
function calculates the number of matching spans per second that match the given span selectors.
Parameters
None.
Examples
The following query shows the rate of errors by service and span name. This is a TraceQL specific way of gathering rate metrics that would otherwise be generated by the span metrics processor.
For example, this query:
{ status = error } | rate() by (resource.service.name, name)
Is an equivalent to using span-generated metrics and running the query.
This example calculates the rate of the erroring spans coming from the service foo
.
Rate is a spans/sec
quantity.
{ resource.service.name = "foo" && status = error } | rate()
Combined with the by()
operator, this can be even more powerful.
{ resource.service.name = "foo" && status = error } | rate() by (span.http.route)
This example still rates the erroring spans in the service foo
but the metrics are broken
down by HTTP route.
This might let you determine that /api/sad
had a higher rate of erroring
spans than /api/happy
, for example.
The count_over_time
function
The count_over_time()
function counts the number of matching spans per time interval.
The time interval that the count will be computed over is set by the step
parameter.
For more information, refer to the step
API parameter.
Example
This example counts the number of spans with name "GET /:endpoint"
broken down by status code. You might see that there are 10 "GET /:endpoint"
spans with status code 200 and 15 "GET /:endpoint"
spans with status code 400.
{ name = "GET /:endpoint" } | count_over_time() by (span.http.status_code)
The min_over_time
, max_over_time
, and avg_over_time
functions
The min_over_time()
function lets you aggregate numerical attributes by calculating their minimum value.
For example, you could choose to calculate the minimum duration of a group of spans, or you could choose to calculate the minimum value of a custom attribute you’ve attached to your spans, like span.shopping.cart.entries
.
The time interval that the minimum is computed over is set by the step
parameter.
The max_over_time()
let you aggregate numerical values by computing the maximum value of them, such as the all important span duration.
The time interval that the maximum is computed over is set by the step
parameter.
The avg_over_time()
function lets you aggregate numerical values by computing the maximum value of them, such as the all important span duration.
The time interval that the maximum is computer over is set by the step
parameter.
For more information, refer to the step
API parameter.
Parameters
Numerical field that you want to calculate the minimum, maximum, or average of.
Examples
This example computes the minimum duration for each http.target
of all spans named "GET /:endpoint"
.
Any numerical attribute on the span is fair game.
{ name = "GET /:endpoint" } | min_over_time(duration) by (span.http.target)
This example computes the minimum status code value of all spans named "GET /:endpoint"
.
{ name = "GET /:endpoint" } | min_over_time(span.http.status_code)
This example computes the maximum duration for each http.target
of all spans named "GET /:endpoint"
.
{ name = "GET /:endpoint" } | max_over_time(duration) by (span.http.target)
{ name = "GET /:endpoint" } | max_over_time(span.http.response.size)
This example computes the average duration for each http.status_code
of all spans named "GET /:endpoint"
.
{ name = "GET /:endpoint" } | avg_over_time(duration) by (span.http.status_code)
{ name = "GET /:endpoint" } | avg_over_time(span.http.response.size)
The quantile_over_time
and histogram_over_time
functions
The quantile_over_time()
and histogram_over_time()
functions let you aggregate numerical values, such as the all important span duration.
You can specify multiple quantiles in the same query.
The example below computes the 99th, 90th, and 50th percentile of the duration attribute on all spans with name GET /:endpoint
.
{ name = "GET /:endpoint" } | quantile_over_time(duration, .99, .9, .5)
You can group by any span or resource attribute.
{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target)
Quantiles aren’t limited to span duration.
Any numerical attribute on the span is fair game.
To demonstrate this flexibility, consider this nonsensical quantile on span.http.status_code
:
{ name = "GET /:endpoint" } | quantile_over_time(span.http.status_code, .99, .9, .5)
This computes the 99th, 90th, and 50th percentile of the values of the status_code
attribute for all spans named GET /:endpoint
.
This is unlikely to tell you anything useful (what does a median status code of 347
mean?), but it works.
As a further example, imagine a custom attribute like span.temperature
.
You could use a similar query to know what the 50th percentile and 95th percentile temperatures were across all your spans.
The compare
function
The compare
function is used to split a set of spans into two groups: a selection and a baseline.
It returns time-series for all attributes found on the spans to highlight the differences between the two groups.
This is a powerful function that’s best understood by using the Comparison tab in Explore Traces. You can also under this function by looking at example outputs below.
The function is used like other metrics functions: when it’s placed after any trace query, it converts the query into a metrics query:
...any spanset pipeline... | compare({subset filters}, <topN>, <start timestamp>, <end timestamp>)
Example:
{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error})
This function is generally run as an instant query. An instant query gives a single value at the end of the selected time range. Instant queries are quicker to execute and it often easier to understand their results The returns may exceed gRPC payloads when run as a range query.
Parameters
The compare
function has four parameters:
Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise, it is part of the baseline. Common filters are expected to be things like
{status=error}
(what is different about errors?) or{duration>1s}
(what is different about slow spans?)Optional. The second parameter is the top
N
values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the topN
values (based on frequency) are returned, and an error indicator for the attribute is included output (see below). Defaults to10
.Optional. Start and End timestamps in Unix nanoseconds, which can be used to constrain the selection window by time, in addition to the filter. For example, the overall query could cover the past hour, and the selection window only a 5 minute time period in which there was an anomaly. These timestamps must both be given, or neither.
Output
The outputs are flat time-series for each attribute/value found in the spans.
Each series has a label __meta_type
which denotes which group it is in, either selection
or baseline
.
Example output series:
{ __meta_type="baseline", resource.cluster="prod" } 123
{ __meta_type="baseline", resource.cluster="qa" } 124
{ __meta_type="selection", resource.cluster="prod" } 456 <--- significant difference detected
{ __meta_type="selection", resource.cluster="qa" } 125
{ __meta_type="selection", resource.cluster="dev"} 126 <--- cluster=dev was found in the highlighted spans but not in the baseline
When an attribute reaches the topN limit, there will also be present an error indicator.
This example means the attribute resource.cluster
had too many values.
{ __meta_error="__too_many_values__", resource.cluster=<nil> }