Even More Blazin' Fast PromQL

• 2019-12-18 • 5 min

At KubeCon San Diego, I presented an updated and revised version of my talk “Blazin’ Fast PromQL”. In this blog post I’ll give you a brief write up of the talk and steps to reproduce the results yourself.

Motivation: High Cardinality Meets Single-Threading

It is the conventional wisdom in Prometheus-land to not include more than ~100k timeseries in a single PromQL query. This is a reasonable limit when you consider Prometheus instances are co-located with the jobs they monitor; in multi-team and multi-region deployments, there is a natural limit to the amount of data in a single Prometheus server. This reasoning breaks down when you want to pull together all the data from a fleet of Prometheus instances into a single place. It stands to reason that you will exceed both the storage and query capability of a single machine.

Let’s put some numbers to this: At Grafana Labs we run ~10 geographically distributed Kubernetes clusters, each with at least a single pair of Prometheus instances. Each instance has on average ~300k active series, the biggest having ~2m, and in total we have ~6m. A query for the container_cpu_usage_seconds_total metric on a single cluster is only around ~30k series; across all clusters it would be ~300k.

We use Grafana Cloud’s Prometheus service to aggregate together data from our multiple Prometheus instances in a single place, both for long term storage and to be able to run queries over all our data. Grafana Cloud’s Prometheus service is based on Cortex, an open source CNCF project building a horizontally scalable, highly available Prometheus service that I started over three years ago. Cortex is committed to getting as close to 100% Prometheus compatibility as possible – and as such, we reuse the PromQL engine (and much of the storage format) from Prometheus itself.

But the PromQL engine in Prometheus is “single threaded” (well, single goroutine’d :-)). This has so far not been an issue for Prometheus, especially when considering Prometheus itself has a single-binary, single-process, no-dependencies model that eskews distributed systems. But CPU cores are getting slower and more abundant, so we can’t rely on Moore’s Law to accelerate our PromQL queries. The rest of the post is about the various techniques we’ve used to parallelize and shard PromQL over multiple cores (and multiple machines).

Step 1: Time-based Parallelization and Caching

Our first attempt to solve this problem involved caching the results of repetitive queries, a la Trickster, and splitting long PromQL queries up along the time axis. This brought huge performance improvements to the vast majority of queries, but didn’t solve the high cardinality problems described above. We’ve gone into more depth about these techniques in a previous post, so I won’t repeat myself here.

As part of this talk we gave a demo of running the parallelizing, caching query frontend against a normal Prometheus server, demonstrating these techniques don’t only apply to Cortex. Give it a try yourself:

$ # Run a Grafana instance to compare the results
$ docker run -d -p 3000:3000 grafana/grafana

$ # Grab & build Cortex, and run an instance of the query frontend
$ git clone https://github.com/cortexproject/cortex
$ cd cortex
$ go build ./cmd/cortex
$ ./cortex \
  -config.file=./docs/configuration/prometheus-frontend.yml \
  -frontend.downstream-url=http://demo.robustperception.io:9090

Go to http://localhost:3000 and log in with the username/password “admin/admin”. Add two Prometheus datasources, one pointed at http://demo.robustperception.io:9090 and another at http://host.docker.internal:9091.

Using Grafana’s explore mode, run the query histogram_quantile(0.50, sum by (job, le) (rate(prometheus_http_request_duration_seconds_bucket[1m]))) against Robust Perception’s Prometheus instance over 7 days. It should take ~5-6 seconds. Now run the same query against the local Cortex frontend; it should take ~2 seconds. With a cold cache, all this is showing is the effect of parallelizing the query. Now run the query again – this time hitting the cached results – and see that it only take ~100ms.

Don’t look at the absolute numbers for these query timings – they’re pretty anecdotal, depending on things like your internet connection and how many people are using the Prometheus instance, and will change from Prometheus release to release. But the relative difference between them has been pretty consistent for months now.

Now go and deploy the query frontend in front of all your Prometheus instances (and even Thanos), and watch the load on Prometheus instances fall and all your dashboards render faster.

Step 2: Aggregation Sharding

New to the KubeCon talk was the concept of aggregation sharding. This is where we effectively take queries that look like this…

sum(rate(foo{bar=”baz”}[1m]))

…and turn them into queries that look like this…

sum(
  sum(rate(foo{bar=”baz”, shard=”1”}[1m]))
  “+”
  sum(rate(foo{bar=”baz”, shard=”2”}[1m]))
  ...
  sum(rate(foo{bar=”baz”, shard=”16”}[1m]))
)

…executing each of the “partial” aggregations on a different worker. Early results from a test and staging environments show this can results in ~5x reduction in query latency for high-cardinality queries.

There is a lot of subtlety here: ensuring labels are propagated correctly between the partial aggregations; ensuring things like binary operations are not sharded; translating queries that cannot easily be sharded into ones that can (such as translating an average into a sum divided by a count). The Cortex query frontend is starting to look a lot like a query planner with a healthy dose of MapReduce thrown in…

This is still an active work-in-progress, and I want to give a big shout-out to Owen Diehl and Cyril Tovena for this amazing work. Once we have this in production with Cortex, we’re really excited about making this technique available to things like Thanos, where it will allow us to push the partial aggregations out to the edge Prometheus instances and not have to ship the chunk data across the wide area network.

TLDR; Faster, Higher Cardinality Queries

If you’ve gotten this far – well done!

In this blog post we’ve given you a glimpse into how we’re pushing the limits of what you can do with the Prometheus query language. A combination of time-based parallelization, results caching, and aggregation sharding allows you to execute queries that would have taken 10s of seconds with sub-second latency. This allows users to run queries over all your Prometheus data at once – something that is key for use cases like capacity planning and performance regression analysis. And we’re committed to making this technology available to the entire Prometheus ecosystem – not just Cortex, but “normal” Prometheus, Thanos and more.

Even More Blazin' Fast PromQL

Motivation: High Cardinality Meets Single-Threading

Step 1: Time-based Parallelization and Caching

Step 2: Aggregation Sharding

TLDR; Faster, Higher Cardinality Queries

Related content

Grafana Loki 101: How to ingest logs with Alloy or the OpenTelemetry Collector

PromQL vector matching: what it is and how it affects your Prometheus queries

How to use OpenTelemetry and Grafana Alloy to convert delta to cumulative at scale