Cortex v1.1 released with improved reliability and performance

Published: 21 May 2020 RSS

Today we’re releasing Cortex 1.1, the first (minor) release since Cortex went GA in March, over 6 weeks ago. This release represents more than 140 commits by over 30 different authors from 9 different companies. In this post we’re going to give you some of the highlights of this release. For more details, check out the changelog.

More features and reliability

With this release, Cortex now supports the Prometheus /api/v1/metadata API. The Grafana Cloud Agent will send metric metadata to Cortex, allowing you to access your metric’s metadata (HELP, TYPE, and UNIT) within Grafana’s Explore view.

In the v1.0 release, we added an experimental Write-Ahead-Log (WAL) for samples that haven’t been committed to the chunk store yet. This ensures that, should a machine fail, those samples aren’t lost. After extensive experience testing and running the WAL in production, we’ve marked the feature as production-ready in 1.1.

Even faster queries

This release features a novel optimization for a specific type of query: regular expression selectors with many chained OR cases, e.g., {foo="bar|baz|blip|..."}. These kinds of queries are commonly generated by Grafana dashboards using template variables. Inspired by a similar optimization in Prometheus’s TSDB, we have removed the need to use a regular expression when performing index lookups in the ingester. In certain cases, this can result in up to 100x improvement in query performance:

➜  cortex git:(regex-opt-ingester) ✗ benchcmp old.txt new.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                             old ns/op     new ns/op     delta
BenchmarkSetRegexLookup/select_all-8                  435911065     882075        -99.80%
BenchmarkSetRegexLookup/select_two-8                  247328056     23848         -99.99%
BenchmarkSetRegexLookup/select_half-8                 327012500     530910        -99.84%
BenchmarkSetRegexLookup/select_none-8                 231561666     24398         -99.99%
BenchmarkSetRegexLookup/equality_matcher-8            2474          2488          +0.57%
BenchmarkSetRegexLookup/regex_(non-set)_matcher-8     274129027     276701117     +0.94%

We’ve also made a series of improvements to the Cassandra chunk storage backend, adding TLS host verification and adding the option to limit the concurrency on reads to prevent overwhelming the database.

Finally, we’ve embedded the query frontend component into the single-process Cortex deployment and simplified how to configure it, preventing common misconfigurations. The query frontend is the Cortex component, which implements query sharding, parallelization, and caching and dramatically improves query performance.

Edging close to production-ready block storage

For the past ~6 months we’ve been working with the Thanos team to integrate their block-and-object-storage approach into Cortex. With this release, we’ve added caching of chunk data, improved the memory usage of the ingesters, and introduced blocks sharding via a new store-gateway service.

The store-gateway service sits between queriers and long-term storage and allows the blocks storage to horizontally scale the read path. When the store-gateway is enabled, blocks are sharded and replicated across all store-gateway instances, and then, at query time, the querier fetches relevant series from the minimum set of store-gateway instances holding the required blocks. Before this change, all blocks were loaded on every single querier, thus imposing a vertical scalability limit; with the introduction of the store-gateway, blocks are no longer loaded into queriers, and we can now horizontally scale querying block indexes.

Cortex v1.1 and the future

Since the v1.0 release we’ve seen an uptick in interest in Cortex, matched by an increase in development velocity from the community. I would like to encourage you to try out this release and have a play with Cortex – and if that all looks a bit daunting to you, check out Grafana Cloud, where Cortex is used behind the scenes to power our Prometheus service.

For more on Cortex, you can also watch the on-demand recording of the recent Taking Prometheus to Scale with Cortex webinar I did with Goutham Veeramachaneni.

Related Posts

At GrafanaCONline, Grafana Labs Senior Software Engineer Ed Welch detailed how he gets the most out of his Nissan Leaf battery using Grafana, Cortex and Loki. Also learn about creating a Raspberry Pi-based desktop Kubernetes cluster.
The WAL feature, added in January 2020, is ready to shed its experimental tag.
On March 16, Grafana Cloud’s Hosted Prometheus service experienced a 12-minute outage. Here's our incident postmortem.