At KubeCon + CloudNativeCon in Barcelona last week, Weaveworks’ Bryan Boreham and I did a deep-dive session on Cortex, an OSS Apache-licensed CNCF Sandbox project. A horizontally scalable, highly available, long term storage for Prometheus, Cortex powers Grafana Cloud’s hosted Prometheus.
During our talk, we focused on the steps that we’ve taken to make Cortex’s query performance awesome.
Cortex embeds the Prometheus PromQL engine, and mates it to Cortex’s own scale-out storage engine. This allows us to stay feature compatible with queries users can run against their Prometheus instance. Queries are handled by a set of stateless, scale-out query jobs. You could add more to “increase” performance. But this only improved the handling of concurrent queries. Single queries were still handled by a single process.
It may sound obvious, but we have to be able to answer queries in Cortex using only the information in the user’s query – the label matchers and the time range. First we find the matching series in the query using the matchers. Then we find all the chunks for those series, for the time range we’re interested in, using a secondary index. We then fetch those chunks into memory, and merge and deduplicate them in a format that the Prometheus PromQL engine can understand. Finally, we pass this to the PromQL engine to execute the query and return the result to the user.
In the past year, we’ve made several optimizations.
Optimization 1: Batch iterators for merging results
Cortex stores multiple copies of the data you send it in heavily-compressed chunks. To run a query, you have to fetch this data, merge it, and deduplicate it.
The initial technique used to do this was very naive. We would decompress the data in memory and merge the decompressed data. This was very fast, but it used a lot of memory and caused the query processes to OOM (out of memory) when large queries were sent to them.
So we started using iterators to dedupe the compressed chunks in a streaming fashion, without decompressing all the chunks. This was very efficient and used very little memory – but the performance was terrible. We used a heap to store the iterators, and operations on the heap (finding the next iterator) were terribly expensive.
We then moved to a technique that used batching-iterators. Instead of fetching a single sample on every iteration, we fetched a batch. We still used the heap, but we had to use it significantly less. This was almost as fast as the original method, and used almost the same amount of memory as the pure iterator-based approach.
Optimization 2: caching . . . everything
As explained, Cortex first consults the index to work out what chunks to fetch, then fetches the chunks, merges them, and executes the query on the result.
We added a series of memcached clusters everywhere possible – in front of the index, in front of the chunks, etc. These were very effective at reducing peak loads on the underlying data and massively improved the average query latency.
In Cortex, the index is always changing. We had to tweak the write path to ensure that the index could be cached. We had to make sure the ingesters held on to data that they had already written to the index for 15 minutes, so that entries in the chunk index could be considered valid for up to 15 minutes.
Optimization 3: Query Parallelization and Results Caching
We added a new job in the query pipeline: the query frontend.
This job is responsible for aligning, splitting, caching, and queuing queries.
Aligning: We align the start and end time of the incoming queries with their step. This helps make the results more cacheable. Grafana 6.0 does this by default now.
Splitting: We split queries that have a large time range into multiple smaller queries, so we can execute them in parallel.
Caching: If the exact same query is asked twice, we can return the previous result. We can also detect partial overlaps between queries sent and results cached, stitching together cached results with results from the query service.
Queuing: We put queries into per-tenant queues to ensure a single big query from one customer doesn’t denial of service (DoS) smaller queries from other customers. We then dispatch queries in order, and in parallel.
We’ve made numerous other tweaks to improve performance.
We optimized the JSON marshalling and unmarshalling, which has a big effect on queries that return a very high number of series.
We added HTTP response compression, so users on the ends of slower links can still get fast query responses.
We have hashed and sharded index rows to guarantee a better load distribution on our index. Plus that means we can look up smaller rows in parallel.
In short, every layer of the stack has been optimized!
And the results are clear: Cortex, when run in Grafana Cloud, achieves <50ms average response time and <500ms 99th percentile response time across all our production clusters. We’re pretty proud of these results, and we hope you’ll notice the improvements too.
But we’re not done yet. We have plans, in collaboration with the Thanos team, to further parallelize big queries at the PromQL layer. This should make Cortex even better for high cardinality workloads.
Most large-scale, clustered TSDBs talk about ingestion performance, and in fact, ingesting millions of samples per second is hard. We spent the first three years of Cortex’s life talking about ingestion challenges. But we have now moved on to talking about query performance. I think this is a good indication of the maturity of the Cortex project, and makes me feel good.