Inside PromQL: A closer look at the mechanics of a Prometheus query
Even though I’m a Prometheus maintainer and work inside the Prometheus code, I found many of the details of PromQL, the Prometheus query language, obscure. Many times I would look something up, or go deep into the code to find out exactly what it did, only to forget it again the next month.
So, trying to live up to my job title of Distinguished Engineer at Grafana Labs, I resolved to write the definitive guide: what really happens when I execute a PromQL query?
At PromCon 2024 last month, I shared my findings with the broader Prometheus community in a session called “Inside a PromQL Query: Understanding the Mechanics.”
In this blog post, I recap and expand on some of the main points I touched on during that talk. My hope is to offer you a peek under the hood of Prometheus, and a better understanding of how data flows from its source to its final destination at the API. Read on to learn more, and to explore additional resources you can use to further dig into this (extensive) topic.
Note: This is a description of PromQL implementation as of the Prometheus 2.54 release and 3.0 beta release. Details are likely to change over time.
PromQL overview
Before diving into the mechanics of a PromQL query, it’s important to reiterate a few fundamental concepts.
The purpose of a PromQL query is to express which metrics you want to use, and which operations you want to perform on those metrics, to compute a result.
Typically, these queries are used to populate a dashboard, and there are two primary query types:
- An instant query to look at single point in time:
- A range query to see variation over a time range:
PromQL is defined in great detail in the documentation, so we won’t go too deep here, but briefly a query is built up from:
- Selectors, with a metric name and label matchers. For example,
http_requests_total{status="200"}
. - Functions, such as
abs
to take the absolute value orrate
to compute the rate of increase per second. - Aggregations, like
sum
andmax
, with optional dimensions, e.g.,sum by (status)
. - Operators, like
>
and+
.
An example query is sum by (status) (rate(http_requests_total[5m]) > 0)
.
This takes the 5-minute rate of HTTP requests, filters to those that had some requests, then sums by status.
Inside Prometheus, the module that does most of the work is called the “engine.” PromQL is processed by the parser to create an Abstract Syntax Tree (AST) which the engine executes, pulling data from storage, and then outputs the result. (We’ll cover ASTs more below.)
Parsing
The parser takes the textual form of PromQL and transforms it into a data structure to drive the PromQL engine. As an example, let’s take this query:
rate(http_requests_total[5m]) > 0
The parser will turn it into the following tree:
As mentioned above, the technical name for this is an Abstract Syntax Tree. Simply put, it’s a data structure with nodes representing selectors, operators, functions, etc. It has one root (at the top of the picture — computer trees grow downwards), and you can “walk” down the links to find everything that was in the original query.
PromQL execution
The first thing the PromQL engine does is walk the AST and call Select()
for every selector node. Select()
is an interface implemented by Prometheus’ built-in storage and any remote storage that Prometheus is configured to talk to. We’ll focus on how the built-in storage works.
How selectors are looked up in the index
To illustrate how series are looked up, we’ll use some sample data. We’re counting requests to an HTTP server, split by HTTP method and response status. In addition to these labels, each series has an ID, which is just an arbitrary number.
ID | Series |
13 | http_requests_total{method="GET",status="200"} |
42 | http_requests_total{method="GET",status="404"} |
23 | http_requests_total{method="PUT",status="200"} |
21 | http_requests_total{method="PUT",status="404"} |
73 | http_requests_total{method="PUT",status="500"} |
A selector in PromQL looks like a series, but it doesn’t have to specify all labels. For example http_requests_total{status="200"}
should fetch the two series that have that status.
The index starts with label names. Under each name are all the values for that label, and against each value is a list of series IDs. Traditionally, in databases, these lists are given the unusual name “postings.”
The name of the metric is just another label, with the special name __name__
.
So, a selector like http_requests_total{status="200"}
is equivalent to {__name__="http_requests_total",status="200"}
.
To look up the index, we take all the postings lists that match each part and intersect, to give us all series that match all parts of the selector.
A selector can use not-equals, in which case, we take away any series that match. First, we evaluate all selectors that add to the result, and then we remove parts.
Selectors can match a regular expression, like status=~".00"
. Regular expressions are matched against each possible value to find out which series to include.
Prometheus has a special handling for regular expressions, like status=~"200|404"
, which can only match a fixed set of strings. In this case, the string values are looked up directly in the index, avoiding a scan of all possible values.
For lots of low-level detail on how the postings index is stored on disk, see Ganesh Vernekar’s blog post TSDB: Persistent Block and its Index.
This section described what happens when the PromQL engine calls Select()
to get series. Next, we will look at how the individual data points, or samples, are fetched from series.
Sample timestamps
To illustrate the concepts of series and samples, let’s look at a simple query, shown here in Grafana.
The query returned 14 time series, each relating to a different source of data. Each sample has a value, plotted on the vertical axis and a timestamp, which Grafana lays out horizontally. Each series is drawn in a different color. The timestamps are in nice, regular intervals, according to the step parameter; in this case, Grafana picked 1 minute. Having all points lined up is what you want if you’re going to compare the values, or do arithmetic on them.
However, real-world data is not usually so neat. In Grafana, if you select Type: Instant but put the whole dashboard range in as a range vector selector [$__range]
, you can see the underlying data points. This little trick is extremely useful when working with PromQL to understand what goes into the computation.
In our example, each data source provides data at a different time offset, so the points as a whole look scattered across the panel.
When PromQL takes a sample from each time series, all at the same timestamp, this is called an instant vector. A range vector, on the other hand, is a set of time series containing a range of data points over time for each series. In PromQL, a range vector is specified with a duration in square brackets, like [5m]
for five minutes.
Note that “range vector” is different from “range query.” A range vector is found inside a PromQL expression, while a range query executes a whole PromQL expression at regular intervals from a start time to an end time.
Expanding series to samples
Down at the code level inside the PromQL engine, the call to Select()
gives us an “iterator” abstraction; as each selector node is visited, the nodes are turned into sets of Series
objects. Every time we need a data point, we seek
to the time requested minus the lookback time, and then step forward, sample by sample, until we hit the time requested.
“Sample” can be either a floating-point number or (if the currently experimental native histograms feature is enabled) a complete histogram.
PromQL functions
PromQL has about 75 functions defined, which can be grouped into categories based on how they work:
- Simple functions that take in one data point and output one result. Examples include
abs
andsqrt
. Given an instant vector, they loop over all series, compute the result, and output an instant vector. - A couple of very similar functions that take extra parameters, such as
clamp
. - A set of date/time functions, such as
month
,year
, andtimestamp
, that either work the same as above or, if given no parameter, operate on the current time that PromQL is looking at. - Histogram functions that only work on native histogram samples.
- Functions that operate on a range vector, like
rate
,max_over_time
, andresets
. These compute a result from the whole set of points in the time range window for each series. label_join
andlabel_replace
, which work on series and not on samples, and are handled specially.- Sorting functions that take an instant vector and sort by either values or labels.
For most functions, the output set of series corresponds one-to-one with the input set of series, however, the series name (the __name__
label) is removed on the grounds that f(foo)
is not foo
. The one exception is the function last_over_time
, because its values do exactly match the original values from the series, just shifted in time.
PromQL aggregations
Aggregations look like functions, but work very differently. For example, this aggregation will evaluate the rate part then sum by status:
sum by (status) (rate(http_requests_total[5m]))
There are three styles of aggregation:
sum
,avg
,count
,stdvar
,stddev
, orquantile
: These produce one output series for each group specified in the expression, with just the labels fromby(...)
.topk
,bottomk
,limitk
, orlimit_ratio
: Output has the same labels as the input, but just k (which is a number, and the first parameter to the aggregation) of them per group.count_values
: Output has theby(...)
labels plus one more, which is the one being counted.
In each case, we construct an object for each output series, and loop over the inputs to accumulate the desired result.
PromQL operators
PromQL has many operators, as you would expect from a rich language:
- Arithmetic binary operators
+
,-
,*
,/
,%
,^
- Comparison binary operators
==
,!=
,>
,<
,>=
,<=
- Logical/set binary operators
and
,or
,unless
When a binary operator is used between two instant vectors, PromQL goes through a process of matching labels. Here is a basic example, where PromQL will match labels on each side one-for-one:
mem_total_mb - mem_free_mb
As with functions, there is no __name__
label on the output.
A common pattern using operators in PromQL is to “join” a series with some values with an “info” series containing more labels, and where every value is 1. To illustrate:
disk_mb * on (host) group_left(team) host_info
In the expression, on
gives the list of labels to match on, and group_left
says the left-hand side can have multiple matches. Extra fields in brackets are added from the other side.
The PromQL engine first extracts the “signature” of each series, with just the on
labels, then constructs a map (hash table) of signatures on the “one” side, and then goes through the “many” side and finds the match.
Finally, the result series are assembled, taking all labels (except __name__
) from the “many” side and the extra labels listed after group_left
from the “one” side.
Binary operators can be one of the most computationally expensive parts of PromQL, because the process of building a hash table and looking for matches is repeated on every time step, in case series have started or stopped since the last step.
Sorting
Data from an instant query comes back in an arbitrary order, unless you include a function like sort
or sort_by_labels
to request a particular order. On the other hand, range queries are always sorted alphabetically by labels after all query processing, so including a sorting function in the PromQL for a range query is just wasting time.
Output
PromQL queries are most commonly sent to Prometheus over its HTTP API, where the results come back in JSON. The translation to JSON is done by the web/api
package of Prometheus, outside of the PromQL engine.
If an error occurs while executing the query, you will not receive any data, but instead an error
string together with an errorType
that classifies the error into a category, such as bad_data
or timeout
. Inside the PromQL engine, errors are reported using Go’s panic
call, used like an exception in Java or C++. It avoids the code having to check for errors at every point along the way.
If the query does complete successfully, in addition to the data you requested, you may also receive a set of "infos"
and/or "warnings"
. Info is milder — for instance, a counter name not ending in _total
— while warnings are more serious and likely indicate a problem with your query. An example of a warning is a quantile parameter outside the range 0..1.
You can also ask for query execution statistics by adding &stats=1
to your query URL. This will get you an extra section at the end of the JSON with timing and data size details, such as queryPreparationTime
and execTotalTime
. These measurements are also collected into a summary metric prometheus_engine_query_duration_seconds
.
How can you see inside?
For even more detail than metrics or query execution statistics, you can use tracing and profiling to gain visibility into your queries.
Tracing
Prometheus is instrumented with OpenTelemetry tracing, so if you have a distributed tracing system like Jaeger or Grafana Tempo, you can use that to view individual queries and see how they were executed.
This example is the query sum by(job, mode) (rate(node_cpu_seconds_total[1m])) \ / on(job) group_left sum by(job)(rate(node_cpu_seconds_total[1m]))
.
Traces are made up of spans, each with a start and end time. In the viewer, time goes from left to right, and flow from top to bottom. The indentation of trace spans shows how the PromQL engine executes each node in the AST. In this case, a BinaryExpr node calls two AggregateExpr nodes, each of which calls a Call node representing the rate()
function. Data fetching for range vector selectors appears as part of each Call span.
Profiling
Prometheus is also instrumented for CPU and memory profiling, courtesy of the Go language runtime, so we can take a look at CPU usage. The image below shows Prometheus executing a mix of queries over a 30-second period, so it can’t be mapped directly to an individual query, but you can still see how the calls break down. This is a flame graph view, in which calls from one function to another go from top to bottom, and the width of each bar indicates how much of the profile time was spent in that function.
How to learn more
As I mentioned in my PromCon 2024 talk, there’s a lot to cover in terms of PromQL query execution and how to make your own queries more effective. While I hope this blog post and my talk are helpful resources — you can find the full slide deck here (3.4MB) — I realize you could go much, much deeper into this topic.
If you’re learning PromQL, Grafana’s builder mode for the Prometheus query editor is very useful to see how expressions build up, and makes it easy to see what operations are available. PromLens, a stand-alone tool and another query builder for PromQL, takes this a step further and helps you build, understand, and optimize your queries. The Prometheus documentation is the ultimate reference, but this helpful PromQL cheat sheet may be easier to digest.
Lastly, there is a large Grafana Community Slack and several Prometheus community channels that you can use to seek advice or swap tips. I look forward to seeing you there.