Grafana Labs has been running Cortex for more than a year to power Hosted Prometheus in Grafana Cloud. We’re super happy: It’s been incredibly stable and has recently gotten insanely fast. Here’s what you need to know about Cortex, what we’ve been doing to Cortex in the past year, and what we plan on doing in the coming months.
What is Cortex?
Cortex is a Prometheus-compatible time series database that has a different take on some of the tradeoffs Prometheus makes. Cortex is a CNCF Sandbox project.
Cortex is horizontally scalable; it is not limited to the performance of a single machine. It can be clustered to pool resources from multiple machines, theoretically scaling to infinity! Cortex is also highly available, replicating the data across multiple machines such that it can tolerate machine failures with no effect on users.
These two Cortex features enable you to run a central Cortex cluster and have multiple Prometheis send their data there. You can then query all your data in one place with one query and get a globally aggregated view of your metrics.
This is super useful when you run multiple, geographically-distributed Kubernetes clusters, each with their own dedicated Prometheus servers. Have them all send metrics to your Cortex cluster and run global queries, aggregating data from multiple clusters, in one place.
Capacity Planning and Long-Term Trends
Cortex provides durable, long-term storage for Prometheus; it stores data in many different cloud storage services (Google Bigtable, GCS, AWS DynamoDB, S3, Cassandra, etc). Cortex uses the cloud storage to offer fast queries for historical data.
Long-term storage allows capacity planning and long-term trend analysis. You can go back a year and see how much CPU you were using, so you can plan the next year’s growth. You can also look at things like long-term performance trends to help identify releases that made latency worse, for instance.
One Cluster, Many Teams
Cortex supports native multitenancy; there can be multiple, isolated instances within a single Cortex cluster.
Multitenancy allows many teams to securely share a single Cortex cluster in isolated, independent Prometheus “instances” – without the overhead of having to operate multiple separate clusters. Simply put, there’s less cognitive load.
Cortex Progress in the Past Year
With the acquisition of Kausal, Grafana Labs invested heavily in the Cortex project in the last year. Here are some changes we’ve driven:
“Easy-to-Use Cortex”: We’ve built a single process/single binary monolithic Cortex architecture that makes it easier to get started and kick the tires. The same binary can be used for a set of disaggregated microservices in production. This work was heavily inspired by the success of a similar approach in Loki.
Query Performance: Over the past year we have built a parallelizing, caching query engine for Cortex. We have optimized Cortex’s indexing and query processing. Some queries have gotten 100x faster. We can now achieve ~40ms average query latency and <400ms P99 latency for our heaviest workloads in production clusters.
HA Ruler: You can now horizontally scale your recording rules, pre-aggregating much more data and helping make queries faster.
Ingesting Data from HA Prometheus Pairs: Cortex has always been highly available, but it has relied on a single source of truth for its data – a single Prometheus node per cluster. We now support highly available Prometheus pairs (or more) to make the whole pipeline highly redundant and replicated. Data is deduplicated on ingestion.
Cortex Going Forward
Cortex is an inherently stateful app, making techniques like continuous deployment challenging. We at Grafana Labs have been doing a release every week, usually every Monday, in which we deploy the latest master into our dev and staging environments and run it for a couple of days, catch bugs, then promote it to prod.
Cortex master branch is incredibly stable already, and those who are also running Cortex are also deploying master. We also give high importance to backwards compatibility, adding new features as off-by-default behind feature flags. But this means that unless operators are keeping a close eye on the changes, they are losing out on improvements. While so far nobody has voiced concerns, we don’t think this is a viable solution long term.
We will cut the first release of Cortex imminently and plan to cut a new release every month with detailed changelog so that folks can follow what’s going on and how to update to the latest and greatest.
We at Grafana Labs will still be running master to make sure our users will get the best of Cortex. We will ensure that master is still very stable for those wanting to deploy the bleeding-edge.
Are we done? Not yet. Our path to 1.0 includes adding a WAL (write-ahead log) for increased durability and deprecating old flags and index schema versions. And our post-1.0 goals include using Prometheus TSDB blocks to make Cortex an order of magnitude cheaper to run.
Interested in learning more? Join the Cortex project on GitHub.