When theory meets scale: Operating Mimir and Tempo in production

Operating distributed databases at scale can be challenging. It is not just about algorithms, but also about surviving failure, unpredictable load, and human mistakes.

In this session, Grafana Mimir maintainer Marco Pracucci and Grafana Tempo maintainer Marty Disibio share hard-earned lessons from growing these two databases to massive scale in Grafana Cloud. Although Mimir and Tempo serve different telemetry signals, they share the same core architecture and design principles.

Marty and Marco start by presenting the architecture and the key decisions that allow Mimir and Tempo to scale reliably and cost-effectively. Then they move to practice, showing how those principles hold up in production – and where they don’t – by contrasting the real bottlenecks, breaking points, and operational failure modes they see in metrics versus traces.

Marty Disibio (00:00):

Hi, hello. Welcome. First, I would like to start with a little preface. Grafana is an open company, but open doesn't just mean open source. Of course, 80% of our code is open source, but it also means we're transparent about our culture and operations, and we hope that this talk is an example of that. And today, we will share with you a few real incidents that we had in Grafana Cloud, and some lessons that we learned the hard way, and how they shaped the way we've improved our databases. Okay, let's introduce ourselves. I'm Marty, an engineer and maintainer of Grafana Tempo, and Tempo is our distributed tracing database.

Marco Pracucci (00:46):

Hello, I'm Marco. I'm a Grafana Mimir maintainer. Similarly to Tempo, Mimir is a distributed time series database, so for metrics, supports OpenTelemetry and Prometheus metrics. Both databases are open source, and they share the same architecture, and they've been designed to run at a massive scale. But these databases are also complex and sometimes they can fail in unexpected ways. Today we will be very open with you and we will share with you some real incidents we had, and what we learned from them. The first case we want to show you is what we call the query of death. Let me introduce it with a real incident we had in March 2023, about three years ago.

(01:42):

This was our largest Mimir cluster at that time. At some point, several alerts fired. The ingestion path, down, queries failing, experiencing high latency. We take a look at the cluster and we notice that the ingester CPU is maxed out. The Kubernetes nodes where these ingesters were running have the CPU saturated. Even worse, the garbage collector can't keep up, and one after another one the ingester goes down, out of memory. Now a quick refresher of the Mimir architecture. The ingester is a stateful component that holds the most recent metrics received before they are available for querying through the object storage. In the old architecture, the one we were running in 2023, the write path, the ingestion path, was directly writing into the ingesters. And then on the read path, when a query came in, the querier was querying the most recent data from the ingesters and the older data from the object storage through the store gateways.

(03:02):

This means that with the old architecture, if the ingesters were down, unavailable, both the ingestion and the read path was unavailable at the same time. Now, back to the incident. The symptoms of this incident were clear, ingesters were somewhat overloaded, but the root cause was still unknown. So we grabbed the CPU profiles and we notice something unexpected. The regular expression engine is consuming all CPU and memory. Now regular expressions are used by some PromQL label matchers, so we do expect to run the engine in the ingesters during the index lookup phase, but we don't expect the engine to take all CPU and memory. Since they're used in queries, we check the in-flight queries through the logs and we find that a customer is running a bunch of queries with very long regular expressions each, about 40 kilobyte regular expressions in each query.

(03:59):

To try to stop the bleeding, we block this query and we immediately contact the customer. But unfortunately, the system does not recover. Turns out the problem was even worse. When one of those queries were canceled or was timing out, under the hood inside the ingester process, the regular expression engine was still running until all the label values were checked. Now, the fix was easy, just periodically checking for cancellation. But I'm showing you the screenshot of the real code, just as an example of an edge case that we haven't considered. When we built these four-line for loops, we haven't considered that there could have been an edge case where this for loop would've taken 15 minutes to run instead of 15 milliseconds.

(04:59):

So after stopping the bleeding by blocking the queries and fixing this bug, we started working on longer-term improvements. And these are the most significant ones. First of all, we introduced what we call regular expression unrolling. The core idea is that many regular expressions can be rewritten as simpler operations, like a set of equality matchers, prefix matchers, suffix matchers, and so on. We support many of them. When a regex is unrolled, we can completely skip the regex engine. This both makes the index lookup faster, but also reduces the resource utilization. Just to give you a rough idea, in Grafana Cloud, nowadays, 97% of the regular expressions used in PromQL completely skip the engine, thanks to the unrolling. Second, we introduced a cost-based planner. Nowadays, when a query is executed, we evaluate the cost of each label matcher in the query, and we run the cheapest but most selective ones first.

(06:12):

The remaining matchers, like complex regular expression matchers, are typically evaluated as a filtering step in a later stage. The main difference is that the set of values against which we evaluate these complex regular expressions is much, much smaller thanks to the lazy label matchers. And we also introduced an overload protection mechanism. The ingesters continuously monitor their CPU and memory utilization, and they can reject queries if they are overloaded. The core idea is that rejecting queries for a short period of time is better than having ingesters going out of memory from which it takes longer to recover. But this was not a one-off incident. Back in 2023, we have seen multiple incidents caused by query of death. And the actual issue was that when the ingesters were overloaded, both ingestion and the query path was down at the same time. But ideally, queries should never be able to take down ingestion.

(07:18):

So Marty, what did we do?

Marty Disibio (07:24):

This led us to take a bigger change. What we needed was a new architecture that had stronger isolation guarantees, and in both the 3.0 releases of Mimir and Tempo, that's what we're introducing, a new Kafka-based architecture that completely decouples ingestion from the read path. So that, as we saw in that previous incident that a burst of queries can no longer take down ingestion, and vice versa. And in the 3.0 releases, in this new architecture there are just two dependencies, object storage, the same as before, and also Kafka, or any Kafka-compatible backend. And in Grafana Cloud we use WarpStream, which is a cloud-native Kafka implementation.

(08:13):

Yeah. Okay. So now here's another story. This time involving Tempo, and what we thought were tiny traces. And this story begins with an alert in one of our cells. Queriers are running out of memory and OOMing, and requests are failing, not just a little, but a lot. And at first glance, we're not sure what the problem is. The request rate is typical, and ingestion hasn't changed. And because queriers are crashing, they're taking down a lot of concurrent requests. So it's hard for us to pinpoint the problem, but it looks like trace lookups are failing more than other read requests. So that's where we focus our attention. And so we have to dig a little bit deeper. And we are able to get a memory profile off of a queriers in the few moments before it OOMs. And what we see are large memory allocations reading the Parquet dictionary.

(09:02):

And our first thought was maybe these are very large traces or something like that. But when we pick out some traces that are failing every time, and we start digging into it, we find the opposite. It's actually very small, only about 500 spans. And what we also didn't expect is that this small trace wasn't received in a short period of time, but it had actually been trickled in just a few spans at a time over seven days. And so it wasn't written to just one Parquet file in object storage like usual, but actually many files over the seven days. And so this still didn't explain why the memory allocation was so high. We knew where and how, but it's still not the why. So we had to dig a little bit deeper. So we get one of those traces that's failing every time, and we start digging a little bit deeper. But to take a step back, actually, to talk about dictionary encoding, it's pretty common in columnar formats. And it's what we use in Parquet.

(10:02):

And the way it works is to read the row in green we first have to unpack the dictionary at the top of the column, and then we are able to extract the value referenced by that row. Normally they're very small and efficient, like this, but when we focused on one of those traces that was failing every time, we start to see the real answer, which is that this trace had very large and high-cardinality JSON attached to it. And it wasn't just this trace, but actually virtually the entire workload was following this pattern of trickling spans in with large JSON over days at a time. And so putting it all together into storage had actually created huge dictionaries in every block. And so to read just one of those tiny traces, we had to unpack gigabytes of dictionaries in the memory. And that was the real source of the OOMs.

(10:50):

So the steps that we took to fix it, first, we were able to block some of the most failing queries. We had some tunables on server side, so we could adjust concurrency to spread out the memory allocations and reduce the OOMs. And this helped address a lot of OOMs. But this incident had actually shown that we had an area in Tempo that was completely undefined. There was no limit to how long an attribute could be, and Tempo had no protection from this write pattern. So we introduced an attribute length limit, and everything longer than that is truncated. And this has the intentional side effect of keeping the dictionary small. But these help stabilize things, but they're not a great answer. Limiting data isn't what we really want to do. What we really wanted to do was support the large attributes and support this workload.

(11:41):

So what we really needed to do was change the way we store the data. And that's what we've been working on.

(11:50):

In Tempo 3, we have a new block format still based on Parquet, an evolution of our format, that gives us more precise control over the dictionary usage on a per-attribute basis. And for the types of traces that we saw in this incident, we're seeing good results, 95% memory reduction on those lookups. And so it would've prevented that incident. It also has some other benefits. It reduces the amount of overread from object storage, because we're no longer having to read those large dictionaries containing unrelated values. But the real goal, finally, is that we want to support that workload. So it will let us greatly raise that attribute length limit. Cool.

Marco Pracucci (12:36):

Back to Mimir. Third, and last, case we want to show you. In this case, we had an incident where a few slow queries ended up slowing down many fast queries. Like most of the incidents, it started with an alert firing. In this case, a specific customer was experiencing a very high query latency. So we started digging into it, and we found out that most of the queries were not slow because of execution. They were slow just because they were waiting in a queue, specifically the query scheduler queue. Now a quick refresher, the query scheduler is a component that holds a per-tenant queue with all the queries that need to be executed and dispatches these queries to the queriers for execution. Then when the querier executes a query, the data can come from two different places. The ingesters, that hold all the most recent data, or from the object storage through the store gateways where we store all the longer-term data.

(13:48):

Depending on the query, a query may just hit the ingester, or just the store gateways, or both of them. Typically, the majority of the queries just hit the ingesters. Back to the incident. We find out that there's a small number of queries, the ones hitting the store gateways, were actually slow. The other queries, the ones hitting the ingesters, were fast, but they were stuck in the query scheduler queue. Why? At that time, the query scheduler had a per-tenant, first-in-first-out queue. So we did have fairness across different tenants, but within a single tenant, every query was treated the same.

(14:37):

Imagine what happens, a slow query enters the system and occupies one query worker for that specific tenant. Then we have a bunch of fast queries. They get executed quickly, but then another slow query comes in and occupies another query worker. Gradually, these slow queries, even if they are at a low rate, they occupy all the query workers until all of them are occupied. And then when the next query comes in, it'll just be stuck in the queue waiting for a free worker to be executed, even if that query execution would be fast. And again, this was not an isolated case. Back in time, we've seen the store gateways slowing down all the queries multiple times. So what did we do? Well, first of all, we redesigned the query scheduler queue, and we introduced a multidimensional queue. The core idea is to split the per-tenant queue in the query scheduler into multiple lanes, depending on which Mimir component the query will hit under the hood, just ingesters, just the store gateways, or both of them.

(15:51):

This was a very simple change, but with a significant impact. Slow store gateways queries after this change, were not able to slow down fast ingester queries anymore. But there's more why the store gateways was slow, why back in time querying the long-term data was slow. Well, there were multiple reasons that we addressed over time, but in this presentation, I want to focus on two of them. First of all, the store gateways did use an on-disk cache loaded using memory mapping. And when we built it, we didn't know, but we learned the hard way that in Golang, the language we use for our databases, when data is loaded using memory mapping from disk, and a page fault is triggered by the kernel to load the data from disk into memory, the Go processor that's responsible to run the goroutines gets stuck until the page fault is executed.

(16:58):

Now, we didn't notice this at a small scale, low rate of page faults, you don't even notice. But on a large scale, this became a real problem. 'cause the process was stuck or hanging from time to time, even if for very short periods of time, this was increasing the latency. Second, store gateways and queriers were buffering all the data in memory before sending anything back to the client. This had two consequences. First of all, high memory utilization peaks, and sometimes store gateways going out of memory. Second, slow queries, especially for high-cardinality queries. So what did we do? Well, as you can guess, we got rid of memory mapping, and we replaced the cache just using direct disk I/O syscalls. But more importantly, and way more complex, we redesigned the Mimir query engine from scratch. And this took a huge effort. The new Mimir query engine is streaming-based.

(18:09):

The data is streamed from the object storage or the cache to the store gateways, and the store gateways stream the data to the queriers, and the queriers run the new memory query engine that processes the query in a streaming way as well as the data arrives. So the query is evaluated while we still read the data from disk. And the same happens for the data coming from the ingesters. The ingesters stream the data to the queriers, and they feed this data into the query engine.

(18:40):

So nowadays with Mimir 3.0, and the new Mimir query engine, the data is not fully loaded into memory anymore at query time, but it gets processed gradually. This reduced the resources utilization both in the store gateways and queriers, but even more importantly, significantly reduced the query latency for long time range queries or high-cardinality queries. Marty.

Marty Disibio (19:12):

Okay, we have walked you through a few incidents here and how we handled them and the improvements that came out of them. And in hindsight, some of this might seem obvious, but the truth is that it's hard to predict where things will break. And when you're creating something new, you can't prevent unknown unknowns. There will just have to be some lessons you learn the hard way. But there are steps you can take to help yourself. And in the design phase, focus on isolation and smaller blast radius. And the new architectures that we've introduced to each database is a good example of that, that totally decouples the read and write paths. And on-call. At Grafana, we strive to have a healthy on-call. And what this looks like is the philosophy of "You build it, you run it." So the same engineers creating the databases are also the ones running them in the cloud.

(20:01):

We have a follow-the-sun rotation, and also no competing priorities. So it's not just encouraged, but on some teams, the rule that when you're on call, project work stops. And short term, when an incident does happen, try to have a lot of tools in your toolkit that you already know how to use. Things like runbooks and scripts eliminate guessing when something's on fire. And try to build mechanisms that you can use to stop the bleeding, like feature flags, circuit breakers, or being able to block the problematic queries. And long-term. Keep asking questions until you get down to the real root cause, and deeply understand what happened. And use those learnings to continuously improve the system. And this is how we've shaped our databases here at Grafana, and how we run them in Grafana Cloud. Thank you.

Speakers

Marco Pracucci
Principal Software Engineer — Grafana Labs
Marty Disibio
Principal Software Engineer — Grafana Labs

When theory meets scale: Operating Mimir and Tempo in production

Speakers

Marco Pracucci

Marty Disibio

Still have questions?

Get every update