Menu

This is documentation for the next version of Tempo. For the latest stable release, go to the latest version.

Open source

TraceQL metrics functions

TraceQL supports rate, count_over_time, min_over_time, avg_over_time, quantile_over_time, histogram_over_time, and compare functions.

Available functions

These functions can be added as an operator at the end of any TraceQL query.

rate
Calculates the number of matching spans per second
count_over_time
Counts the number of matching spans per time interval (refer to the step API parameter).
min_over_time
Returns the minimum value for the specified attribute across all matching spans per time interval (refer to the step API parameter).
max_over_time
Returns the maximum value for the specified attribute across all matching spans per time interval (refer to the step API parameter).
avg_over_time
Returns the average value for the specified attribute across all matching spans per time interval (refer to the step API parameter).
quantile_over_time
The quantile of the values in the specified interval
histogram_over_time
Evaluate frequency distribution over time. Example: histogram_over_time(duration) by (span.foo)
compare
Used to split the stream of spans into two groups: a selection and a baseline. The function returns time-series for all attributes found on the spans to highlight the differences between the two groups.

The rate function

The rate function calculates the number of matching spans per second that match the given span selectors.

Parameters

None.

Examples

The following query shows the rate of errors by service and span name. This is a TraceQL specific way of gathering rate metrics that would otherwise be generated by the span metrics processor.

For example, this query:

{ status = error } | rate() by (resource.service.name, name)

Is an equivalent to using span-generated metrics and running the query.

This example calculates the rate of the erroring spans coming from the service foo. Rate is a spans/sec quantity.

{ resource.service.name = "foo" && status = error } | rate()

Combined with the by() operator, this can be even more powerful.

{ resource.service.name = "foo" && status = error } | rate() by (span.http.route)

This example still rates the erroring spans in the service foo but the metrics are broken down by HTTP route. This might let you determine that /api/sad had a higher rate of erroring spans than /api/happy, for example.

The count_over_time function

The count_over_time() function counts the number of matching spans per time interval. The time interval that the count will be computed over is set by the step parameter. For more information, refer to the step API parameter.

Example

This example counts the number of spans with name "GET /:endpoint" broken down by status code. You might see that there are 10 "GET /:endpoint" spans with status code 200 and 15 "GET /:endpoint" spans with status code 400.

{ name = "GET /:endpoint" } | count_over_time() by (span.http.status_code)

The min_over_time, max_over_time, and avg_over_time functions

The min_over_time() function lets you aggregate numerical attributes by calculating their minimum value. For example, you could choose to calculate the minimum duration of a group of spans, or you could choose to calculate the minimum value of a custom attribute you’ve attached to your spans, like span.shopping.cart.entries. The time interval that the minimum is computed over is set by the step parameter.

The max_over_time() let you aggregate numerical values by computing the maximum value of them, such as the all important span duration. The time interval that the maximum is computed over is set by the step parameter.

The avg_over_time() function lets you aggregate numerical values by computing the maximum value of them, such as the all important span duration. The time interval that the maximum is computer over is set by the step parameter.

For more information, refer to the step API parameter.

Parameters

Numerical field that you want to calculate the minimum, maximum, or average of.

Examples

This example computes the minimum duration for each http.target of all spans named "GET /:endpoint". Any numerical attribute on the span is fair game.

{ name = "GET /:endpoint" } | min_over_time(duration) by (span.http.target)

This example computes the minimum status code value of all spans named "GET /:endpoint".

{ name = "GET /:endpoint" } | min_over_time(span.http.status_code)

This example computes the maximum duration for each http.target of all spans named "GET /:endpoint".

{ name = "GET /:endpoint" } | max_over_time(duration) by (span.http.target)
{ name = "GET /:endpoint" } | max_over_time(span.http.response.size)

This example computes the average duration for each http.status_code of all spans named "GET /:endpoint".

{ name = "GET /:endpoint" } | avg_over_time(duration) by (span.http.status_code)
{ name = "GET /:endpoint" } | avg_over_time(span.http.response.size)

The quantile_over_time and histogram_over_time functions

The quantile_over_time() and histogram_over_time() functions let you aggregate numerical values, such as the all important span duration. You can specify multiple quantiles in the same query.

The example below computes the 99th, 90th, and 50th percentile of the duration attribute on all spans with name GET /:endpoint.

{ name = "GET /:endpoint" } | quantile_over_time(duration, .99, .9, .5)

You can group by any span or resource attribute.

{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target)

Quantiles aren’t limited to span duration. Any numerical attribute on the span is fair game. To demonstrate this flexibility, consider this nonsensical quantile on span.http.status_code:

{ name = "GET /:endpoint" } | quantile_over_time(span.http.status_code, .99, .9, .5)

This computes the 99th, 90th, and 50th percentile of the values of the status_code attribute for all spans named GET /:endpoint. This is unlikely to tell you anything useful (what does a median status code of 347 mean?), but it works.

As a further example, imagine a custom attribute like span.temperature. You could use a similar query to know what the 50th percentile and 95th percentile temperatures were across all your spans.

The compare function

The compare function is used to split a set of spans into two groups: a selection and a baseline. It returns time-series for all attributes found on the spans to highlight the differences between the two groups.

This is a powerful function that’s best understood by using the Comparison tab in Explore Traces. You can also under this function by looking at example outputs below.

The function is used like other metrics functions: when it’s placed after any trace query, it converts the query into a metrics query: ...any spanset pipeline... | compare({subset filters}, <topN>, <start timestamp>, <end timestamp>)

Example:

{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error})

This function is generally run as an instant query. An instant query gives a single value at the end of the selected time range. Instant queries are quicker to execute and it often easier to understand their results The returns may exceed gRPC payloads when run as a range query.

Parameters

The compare function has four parameters:

  1. Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise, it is part of the baseline. Common filters are expected to be things like {status=error} (what is different about errors?) or {duration>1s} (what is different about slow spans?)

  2. Optional. The second parameter is the top N values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the top N values (based on frequency) are returned, and an error indicator for the attribute is included output (see below). Defaults to 10.

  3. Optional. Start and End timestamps in Unix nanoseconds, which can be used to constrain the selection window by time, in addition to the filter. For example, the overall query could cover the past hour, and the selection window only a 5 minute time period in which there was an anomaly. These timestamps must both be given, or neither.

Output

The outputs are flat time-series for each attribute/value found in the spans.

Each series has a label __meta_type which denotes which group it is in, either selection or baseline.

Example output series:

{ __meta_type="baseline", resource.cluster="prod" } 123
{ __meta_type="baseline", resource.cluster="qa" } 124
{ __meta_type="selection", resource.cluster="prod" } 456   <--- significant difference detected
{ __meta_type="selection", resource.cluster="qa" } 125
{ __meta_type="selection", resource.cluster="dev"} 126  <--- cluster=dev was found in the highlighted spans but not in the baseline

When an attribute reaches the topN limit, there will also be present an error indicator. This example means the attribute resource.cluster had too many values.

{ __meta_error="__too_many_values__", resource.cluster=<nil> }