Introduction to PromQL, the Prometheus Query Language

Published: 4 Feb 2020 RSS

PromQL is the querying language that is part of Prometheus. In addition to PromQL, Prometheus provides a scraper that fetches metrics from instances (any application providing metrics) and a time series database (TSDB), which stores these metrics over time.

This introduction to PromQL will be largely decoupled from specific tools and the non-PromQL parts of Prometheus, in order to focus on the features of the language itself.

I recommend reading Ivana Huckova’s blog post, How to Explore Prometheus with Easy ‘Hello World’ Projects, which has useful tips and links on getting a series and a database set up together with Grafana.

As supplements to this post, check out this excellent recorded talk by Ian Billett from PromCon EU 2019: PromQL for Mere Mortals, and Timber’s handy cheat sheet, PromQL for Humans.

Data types

Prometheus uses three data types for metrics: the scalar, the instant vector, and the range vector. The most fundamental data type of Prometheus is the scalar  –  which represents a floating point value. Examples of scalars include 0, 18.12, and 1000000. All calculations in Prometheus are floating point operations.

When you group together scalars as a set of metrics in a single point in time, you get the instant vector data type. When you run a query asking for only the name of a metric, such as bicycle_distance_meters_total, the response is an instant vector. Since metrics have both names and labels (which I’ll cover a little later), a single name may contain many values, and that is why it’s a vector rather than just a scalar.

An array of vectors over time gives you the range vector. Neither Grafana nor the built-in Prometheus expression browser makes graphs out of range vectors directly, but rather uses instant vectors or scalars independently calculated for different points in time.

Because of this, range vectors are typically wrapped with a function that transforms it into an instant vector (such as rate, delta, or increase) to be plotted. Syntactically, you get a range vector when you query an instant vector and append a time selector such as [5m]. The simplest range vector query is an instant vector with a time selector, such as bicycle_distance_meters_total[1h].

Labels

Some earlier metric systems had only the metric name for distinguishing different metrics from each other. This means that you might end up with metric names such as bicycle.distance.meters.monark.7 to distinguish a 7-geared Monark bicycle from a 2-geared Brompton bicycle (bicycle.distance.meters.brompton.2). In Prometheus, we use labels for that. A label is written after the metric name in the format {label=value}.

This means that our previous two bikes are rewritten as bicycle_distance_meters_total{brand="monark",gears="7"} and bicycle_distance_meters_total{brand="brompton",gears="2"}.

When querying based on metrics, the same format is used. As such, bicycle_distance_meters_total would give us the mileage of all bikes in this example; bicycle_distance_meters_total{gears="7"} would limit the resulting set to all 7-geared bicycles. This allows us much more flexibility without having to resort to weird regex magic as with the older format.

Negation and regular expressions (in Google’s RE2-format) are supported by replacing = with either !=, !~, or =~ for not equal, not matching, and matching respectively. (When selecting multiple values for variables in Grafana, they are represented in a format compatible with =~).

The downside with labels is that the seemingly innocent query bicycle_distance_meters_total can actually return thousands of values, and it’s sometimes not intuitive which queries will end up heavy on the Prometheus server or the client from which you’re querying Prometheus.

Metric types

Prometheus conceptually has four different metric types. The metric types are all represented by one or more scalar values with some different conventions dictating how to use them and for what purpose they’re useful.

Counter and Gauge are basic metric types, both of which store a scalar. A counter always counts up (a reset to zero can happen on restart) compared to a gauge, which can go both up and down. bicycle_distance_meters_total is a counter since the number of kilometers a bike has traveled cannot decrease, whereas bicycle_speed_meters_per_second would have to be a gauge to allow for decreased speeds. By convention, counters end with _total to help the user distinguish between counters and gauges at a glance.

The third data type is Histogram, which offers a single interface for measuring three different things:

  1. <metric>_count is a counter that stores the total number of data points.

  2. <metric>_sum is a gauge that stores the value of all data points added together. The sum can be used as a counter for all histograms where negative values are impossible.

  3. <metric>_bucket is a collection of counters in which a label is used to support calculating the distribution of the values. The buckets are cumulative, so all buckets that are applicable for a value are increased by one on insertion of a sample. There is a +Inf bucket, which should hold the same value as _count.

For a bicycle race, the number of cyclists finishing by the number of hours it takes them to finish could be stored in a histogram with the buckets 21600, 25200, 28800, 32400, 36000, 39600, +Inf. (Time is by convention stored in seconds; this is one bucket per hour for the range [6, 11] hours.)

If there are 2 cyclists finishing in slightly less than 7 hours, 5 in less than 8 hours, 3 in less than 10 hours, and a sole cyclist finishing two days later, the bucket would be represented something like this. (For the purpose of this example, I’ve made up the value for the sum, but of course it can be anything, depending on the values in each bucket.)

race_duration_seconds_bucket{le="21600"} 0
race_duration_seconds_bucket{le="25200"} 2
race_duration_seconds_bucket{le="28800"} 7
race_duration_seconds_bucket{le="32400"} 7
race_duration_seconds_bucket{le="36000"} 10
race_duration_seconds_bucket{le="39600"} 10
race_duration_seconds_bucket{le="+Inf"} 11
race_duration_seconds_count 11
race_duration_seconds_sum 511200

By having a common convention for how to store histograms, Prometheus can provide functions such as histogram_quantile (which calculates quantiles for a histogram – I’ll go into the details of that further down), and external tools such as Grafana can recognize the format and provide histogram features. Since histograms are “just” a collection of counters, the histograms don’t increase the complexity of Prometheus at large.

When using histograms, knowing how the buckets work and that everything above the largest bucket is simply stored as “more than the largest bucket” can help you understand what kind of accuracy you can get from histograms, and by extension, what accuracy you can expect from a calculation.

For instance, having a +Inf bucket with a significantly higher value than the largest bucket might be an indicator that your buckets are misconfigured (and that the values you’re getting from Prometheus are unreliable).

The final type of metrics is Summary. It is similar to a histogram but is defined as a quantile gauge that is calculated by the client to get a higher accuracy of a quantile. The precalculated quantiles for summaries cannot be aggregated in a meaningful way. You can study the quantiles from individual instances of your service, but you cannot aggregate them to a fleet-wide quantile.

One common use case for quantiles is as service level indicators (i.e. SLI/SLO/SLA) to know how large a portion of the incoming requests to a server is slower than say 50ms. With a histogram in which one of the buckets is <0.05 seconds, it’s possible to say with high accuracy how many of the requests were not handled within that time. Adding more buckets will make it possible to calculate quantiles, which gives you an idea of the performance. With summaries, this aggregation is not at all possible.

To summarize, histograms require you to have some level of insight into the distribution of your values to begin with to set up appropriate buckets whereas summaries lack reliable aggregation operations.

Functions and operators

Metrics can be useful by themselves, but in order to maximize their utility, some kind of manipulation is necessary, which is why Prometheus provides a number of operators and functions for manipulation.

Aggregation operators

Aggregation operators reduce one instant vector to another instant vector representing the same or fewer label sets, either by aggregating the values of different label sets or by keeping one or more distinct sets, depending on its values. The simplest form of aggregation operators would appear as avg(bicycle_speed_meters_per_second), which gives you the overall average speed of the bicycles in the set.

If you want to be able to differentiate bicycles by the labels, brand, and number of gears, you can instead use avg(bicycle_speed_meters_per_second) by (brand, gears). ‘bycan be replaced withwithout` if you want to discard a label for the new vector instead of selecting which you’d like to keep.

There are a number of aggregations available, the most prominent being the hopefully self-explanatory sum, min, max, and avg. Some of the more complex aggregators take additional parameters, such as topk(3, bicycle_speed_meters_per_second), which gives you the overall three highest speeds.

Binary operators

The arithmetic binary operators (+, -, *, /, % [modulo/clock counting], ^ [power]) can operate on a combination of instant vectors and scalars, which can lead to a bit of a world of pain if you’re trying to be mathematically sound. So I’ll summarize the different cases and how to handle the weird cases that come with doing vector arithmetics.

Scalar-to-scalar arithmetics is at its core the arithmetics from primary school. Scalar to vector arithmetics is almost as simple: For every value in the vector, apply the calculation with the scalar. (If you have the bicycle_speed_meters_per_second and want to express it in the more kilometers per hour (km/h), that is done with bicycle_speed_meters_per_second*3.6.)

Vector-to-vector arithmetics is where it becomes really interesting. There is label matching going on, and the vectors that have labels that exactly match up to each other are calculated toward each other. All other values are discarded. Example: bicycle_speed_meters_per_second / bicycle_cadence_revolutions_per_minute.

Since that’s often not what you want, you can add an on (or ignoring) operator to the right of the binary operator, and you’ll end up limiting the set of labels being used for the comparison before running it. However, all labels that are not used for the comparison are thrown away. This would be bicycle_speed_meters_per_second / on (gears) bicycle_cadence_revolutions_per_minute.

What if you want to keep those labels? Well, you can keep all labels from the left hand side by adding group_left after the on or ignoring keyword. When you do that, the value of the right-hand side will be applied on each of the left-hand side labels that match the labels from on. In practice, this looks like bicycle_speed_meters_per_second / on (gears) group_left bicycle_cadence_revolutions_per_minute. There is also a group_right, which instead groups by the right-hand side.

In addition to the arithmetic operators, there are also the comparison (==, !=, >, <, >=, <=) and set (and, or, unless) operations.

Comparison operations are defined for instant vectors and scalars. A comparison between two scalars returns either 0 for false or 1 for true and requires the bool keyword after the comparator. For instant vectors, when compared to a scalar, every data point for which the comparison is true is kept, and the others are thrown away.

When comparing two instant vectors, the logic is similar, but per set of labels, both the labels and the values are compared. When the operation returns false, or there is no metric with a matching set of labels on the opposite side, the value is thrown away; otherwise it’s kept. If you want 0 or 1 instead of keeping or tossing the value, you can add the keyword bool after the comparator.

The set operators operate on instant vectors and work by inspecting the label sets on the metrics. For and, if a label set exists on both the left-hand side and right-hand side, the value of the left-hand side is returned; otherwise, nothing. For or, all label sets for the left-hand side are returned, as are the sets for the right-hand side that don’t exist on the left-hand side. And finally, unless returns the values on the left-hand side for which the label set does not also exist on the right-hand side.

Functions

Functions in Prometheus work much like functions in programming in general, but are limited to a pre-defined set. It’s important to know that most of Prometheus’s functions are approximative and extrapolate the result – which occasionally turns what should be integer calculations into floating point values, and also means that Prometheus is really bad to use when exactness is required (for example, for billing purposes).

Some particularly useful functions are delta, increase, and rate. Each takes a range vector as input and returns an instant vector. delta operates on gauges and returns the difference between the start and end of the range. increase and rate operate on counters and return the amount the counter has increased over the specified time. increase gives the total increase over the time, and rate gives per second increase. rate(bicycle_distance_meters_total[1h]) should be the same as increase(bicycle_distance_meters_total[1h]) / 3600. Because increase and rate have logic to handle restarts when the value is reset to zero, it’s important to avoid using them with gauges that go up and down. That could end up looking like a restart to the functions, resulting in a nonsensical value.

To make sense out of the histogram buckets, the histogram_quantile function takes two arguments: first, the quantile that should be calculated, and second, the bucket’s instant vector. For our earlier race example, with a few more labels added, this could be histogram_quantile(0.95, sum(race_duration_seconds_bucket) by (le)), which returns the time in which the 95-percentile racer would finish. The reason we sum the finish time with by (le) before performing the quantile calculation is because the quantile is calculated per unique combination of labels. This allows us to graph things such as the median time per number of gears on the bike with histogram_quantile(0.5, sum(race_duration_seconds_bucket) by (le, gears)).

Read more

PromQL is a domain-specific language with a syntax that hides a few surprises and isn’t always intuitively understandable. I originally wrote an earlier version of this post before I joined Grafana Labs because I wanted to understand the syntax used by Prometheus better, and because I noticed that a lot of PromQL queries out in the wild do not match up with the intentions of their authors. This post is a summary of my experimentation and reading Prometheus’s documentation and source code. I’d recommend reading these pages in the documentation, which cover pretty much the content I’ve written here and a lot more:

Related Posts

At KubeCon, Tom Wilkie presented an updated version of his talk about blazin' fast PromQL. Here's a recap and steps to reproduce the results in your org.
VP of Product Tom Wilkie demos how to accelerate Prometheus queries from four seconds to less than 100 milliseconds.

Related Case Studies

DigitalOcean gains new insight with Grafana visualizations

The company relies on Grafana to be the consolidated data visualization and dashboard solution for sharing data.

"Grafana produces beautiful graphs we can send to our customers, works with our Chef deployment process, and is all hosted in-house."
– David Byrd, Product Manager, DigitalOcean

How Gojek is leveraging Cortex to keep up with its ever-growing scale

Gojek’s Lens monitoring system has 40+ tenants, for which Cortex handles about 1.2 million samples per second.

"The goal is to make sure that whenever a new service or team is created, they automatically get onboarded to the monitoring platform."
– Ankit Goel, Product Engineer, Gojek

How Grafana Cloud is enabling HotSchedules to develop next-generation applications

The visibility for all these metrics helps service delivery teams quickly iterate on new features.

"Grafana Cloud enables us to achieve observability bliss at HotSchedules. We don’t have to worry about scaling and maintaining the service."
– Denise Stockman, Director, Infrastructure, Hotschedules