Addressing metric overload: a deep dive on Adaptive Metrics

• 2024-10-04 • 10 min

When it comes to metrics, development teams often have a “better safe than sorry” mindset of collecting anything and everything. But that can translate to explosive growth of time series data, which can lead to higher costs and noisy signals.

It’s a challenge we saw many of our users facing, which is why we developed Adaptive Metrics, a feature in Grafana Cloud that helps cut costs by aggregating unused and partially used metrics into lower cardinality versions.

Adaptive Metrics was the subject of the most recent episode of “Grafana’s Big Tent,” hosted by Mat Ryer, Grafana Labs Engineering Director. Mat was joined this week by Patrick Oyarzun and Mauro Stettler, both of whom are Principal Software Engineers that work on Adaptive Metrics. They were also joined by Oren Lion, Director of Software Engineering, Productivity Engineering, at TeleTracking, which used the feature to cut its Grafana Cloud Metrics spending by 50%.

In this blog post, we’ll share some of the highlights from the first half of the show, which focused on the problems organizations are facing and how Adaptive Metrics can help. You’ll also get a peek at what the engineers behind Adaptive Metrics are planning for the future, but check out the full episode if you want to learn more about the hackathon origins of Adaptive Metrics and Mat’s attempts at podcast-driven software development. You’ll also hear more about the roadmap and you’ll learn about the technical challenges that go into developing and maintaining this feature.

Note: The following are highlights from episode 4, season 2 of “Grafana’s Big Tent.” The transcript below has been edited for length and clarity.

How metrics can quickly add up

Oren Lion: What I’ve found is that 40% of the time series come from custom metrics, and the other 60% of time series come from dependency metrics. Now, to get a picture of this, here’s a hard example of custom metrics. I’ve got 200 microservices. Each service produces 500 time series, and they’re scaled to three pods. So that’s 300 time series pushing from that cluster—so 300,000. But let’s say we migrate to a new cluster and we blue/green. Now we’re pushing 600,000 time series for a time. And just turning to dependencies, PromTail runs as a daemon set, so there’s one pod per node. It publishes 275 time series per pod, on a 40-node cluster. That’s 10,000 time series.

So how do we get to a million time series? It was fast and easy. When you work with the teams, it’s like they’re thinking ahead about design, and about how to monitor their service and dependencies, but they fall short of estimating and tracking the cost to monitor a service.

Mat Ryer: Well, it’s a tough problem, because you don’t really always know what you’re going to need later. So there’s definitely this attitude of “We’ll just record everything, because we’re better safe than sorry.”

Oren: It’s a verbosity problem. And controlling verbosity is an observability problem. Look at logs and metrics. With logs, you solve the problem of controlling verbosity with log level… Metrics have verbosity, but it comes in the form of cardinality. So we’re generally not as good at filtering metrics in relabel configs as we are at setting a log level for logs. And so there are ways to identify high cardinality, but nothing like a log level. There’s no simple way to just dial it down.

Mat: Well, you’re actually on a podcast with a couple of the engineers here. So if you’ve got ideas to pitch them, by all means. Mauro, what do you think? Label levels for metrics?

Mauro Stettler: Yeah, I agree with everything that Oren said. I would like to add one more thing regarding where the high cardinality comes from. I think in some cases the cardinality is simply an artifact of how the metrics got produced. Usually when the metrics are produced, there are many service instances producing metrics about themselves. And very often you have multiple service instances doing the same thing. They’re basically just replicas of the same service. And due to how they are produced and collected, each of the time series that are produced by each of those instances get a unique label assigned to them. But depending on what the metric represents, you might not actually need to know which of those instances has, I don’t know, increased the counter.

The problem with a piecemeal approach

Mat: How do people solve this today then? Is it a case of you sort of have to go through all your code, look at all the places you’re producing metrics, and try and trim it down? Does that ever work?

Mauro: Very often what I’ve seen in larger organizations is that there are strict rollout policies, which make it impossible to deploy fixes quickly. For example, a new change gets deployed, which blows up the cardinality because a new label has been added, which has a really high cardinality. Then the team of developers maintaining that application realizes the problem, they want to fix it, but it takes weeks to get the fix into production.

Patrick Oyarzun: Yeah. And there’s been existing solutions for a while to try to help control cardinality. Typically, it takes one of two forms. One is just dropping some metrics entirely. For example, say you’re monitoring Kubernetes. You might find a list somewhere of metrics you don’t really need, according to somebody’s opinion or some philosophy that they’ve applied. And you might decide to just drop all of those outright. That still requires changing, relabel configs, which is sometimes hard.

The other way that people try to do this is by dropping individual labels off of their metrics. And so maybe you realize you have some redundant label, or you have a label that is increasing cardinality, but you don’t care about it. The issue with doing that though is a lot of times you’ll run into issues in the database that you’re sending to. So lots of time series databases, including Prometheus—and anything that’s trying to look like Prometheus—require that every series has a unique label set. If you try to, say, just drop a pod label for example on a metric, usually you’ll start getting errors saying things like “duplicate sample received” or “same timestamp, but a different value.”

And so it’s not as simple as just “don’t send the data.”

What exactly is Adaptive Metrics?

Mauro: Adaptive Metrics consists of two parts. The first part is what we call the recommendations engine, which analyzes a user’s series index. So it looks at all of the series that the user currently has, and it looks at the usage on those series. Based on that information, it then tries to identify labels that the user has, which raises the series cardinality, and which the user actually doesn’t need, because they don’t use them, according to their usage patterns. So it generates recommendations saying, “Label X could be dropped,” and this will reduce your active series count by some number.

Then the second part is what we call the aggregator, which is a part in the metrics ingestion pipeline when you send data to the Grafana Cloud. The aggregator allows the user to apply and implement those recommendations by defining rule sets which say, “OK, I want to drop this label from metric X,” and the aggregator then performs the necessary aggregation on that data in order to generate and aggregate with a reduced cardinality, according to the recommendation.

Patrick: We actually have a tool that has existed for a while, called Mimirtool, that can automate a lot of that. And it’s open source, anybody can use it. It’ll tell you which metrics are used, basically.

What Adaptive Metrics does is it goes a step further. So instead of just telling you that the “Kubernetes API server request duration seconds” bucket is used, it’ll also tell you whether or not every label is used. And it’ll also know that in the places it is used, it’s only ever used in a sum-of-rate PromQL expression. And because we know all of that at once, we can actually tell you with confidence, “Hey, if you drop the pod label on that, it’s not going to affect anything.” All of your dashboards will still work, you’ll still get paged for your SLOs.

It also adapts over time. It’s been common for a long time to have these public data sets of all the metrics that you probably want to keep for Kubernetes, Kafka, Redis, or any popular technology. What Adaptive Metrics does is basically it’ll find all of that dynamically, and then over time, as you start using more, or stop using some of it, it’ll start generating new recommendations.

Maybe you can aggregate a little bit more aggressively, because you transitioned from float histograms to native histograms, which is a feature related to the issue Oren was talking about with histogram cardinality. Maybe once you do that migration, Adaptive Metrics might notice, “Hey, that old normal histogram is unused now, and you can get rid of it,” even if you don’t have control of the application that’s generating that data.

So it’s really this feedback loop that I think has made Adaptive Metrics start to stand out. And we’ve had it internally. At this point, we apply the latest recommendations every weekday morning, and we’ve been doing that for quite a long time now. Nobody reviews them; it has generally worked pretty well.

Where do we go from here

Patrick: The word I keep using when I think about [next steps] is “just-in-time metrics.” It’s something that I think we’re just scratching the surface of, but I can imagine a future state where you generally are paying very little for your metric storage, and then something goes wrong and you can turn on the fire hose, so to speak, of like “I want to know everything.” And I think it’s not just metrics. Grafana in general is developing Adaptive Telemetry across all observability signals. There could be a future state where when you turn on that fire hose, it includes all signals, and you’re saying all at once, “In this one region, I know I’m having an issue, and I want to stop dropping metrics, logs, traces, profiles. I want to just get everything for the next hour.” And then I can do my investigation, I can do my forensics, and then I can start saving money again. I think it’s realistic that we could get to that point eventually.

Mat: Yeah. I quite like the idea that you declare an incident and then it just starts automatically. It levels up everything because there’s an incident happening just for that period.

Mauro: Just to extend on that: I think it’s even possible that we will get to a point where you cannot only turn on the fire hose right now, but for the last hour too. That would be the coolest feature that we could build. Because I think the biggest blocker for people is they’re afraid that they’re going to drop information and they’ll regret it later. If they were able to go back in time by just one hour or two hours, even if it’s not very long, I think that would really help a lot of people not worry about dropping labels we say they don’t need.

Mat: Yeah, this is cool. I mean, I don’t think we’re allowed to do podcast-driven roadmap planning, but it feels like that’s where we are. Mauro, just to be clear, that feature would use some kind of buffer to store the data. You wouldn’t actually try and solve time travel and send people back an hour, would you? That’s too far.

Mauro: That’s one possibility, but I don’t have a design doc for that one. But we do have a design doc for the solution with the buffer, because we actually do have the buffer already. We just need to use it.

Mat: I see, cool. Yeah, that does make sense. Do the easier one first, and then time travel later—save it for a different hackathon.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.

Feedback

Relevant sources:

Feedback

Addressing metric overload: a deep dive on Adaptive Metrics

How metrics can quickly add up

The problem with a piecemeal approach

What exactly is Adaptive Metrics?

Where do we go from here

Related content

Addressing metric overload: a deep dive on Adaptive Metrics

How metrics can quickly add up

The problem with a piecemeal approach

What exactly is Adaptive Metrics?

Where do we go from here

Related content

Open standards, lower costs, and centralized observability: Why FalconX moved to Grafana Cloud

OpenTelemetry: past, present, and future

Kubernetes, Kepler, and carbon footprints: the latest tools and strategies to optimize observability...