‘Grafana’s Big Tent’ podcast: Prometheus 3.0, native histograms, and the future of metrics

Grafana Labs Team

•

2025-12-19•9 min

A hobby project inside a music-streaming company. A side effort to address monitoring gaps. Some of the most influential open source projects started as internal tools that, by all reasonable assumptions, were never expected to see the light of day.

That’s how Prometheus started.

In this episode of “Grafana’s Big Tent” podcast, hosts Mat Ryer, Principal Software Engineer at Grafana Labs, and Tom Wilkie, Grafana Labs CTO, sit down with Julius Volz, co-creator of Prometheus and founder of PromLabs and PromCon, and Richard “RichiH” Hartmann, Senior Developer Programs Director at Grafana Labs and Prometheus maintainer, to talk about the journey from Prometheus 1.x to Prometheus 3.0 — and what it means for the future of metrics and observability.

You can watch the full episode in the YouTube video below, or listen on Spotify or Apple Podcasts.

Note: The following are highlights from episode 2, season 3 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.

The origins of Prometheus: an observability side project

Julius Volz: In 2012, I came from Google to SoundCloud. At the same time, Matt Proud [formerly the Director of Technical Infrastructure/Platforms at SoundCloud] had made the same move. We were tasked with making SoundCloud more reliable. SoundCloud was always down, slow, and unreliable. They already had a cluster scheduler — self-built, in-house, and very improvised — but we couldn’t really monitor everything properly with the existing tools in the open source world.

Both Matt and I really started missing the monitoring tool we had at Google called Borgmon. Over a couple of months, we kept coming back to the same realization: if we want to improve things, we need better insight and better visibility. Today, you’d call it observability. So we just started building what became Prometheus, at first in our free time, then more and more in SoundCloud time.

Eventually, Prometheus was mature enough, documented enough, and had enough components in its ecosystem — metrics exporters and so on — that we decided to fully publish it, both as an open source project, but also with a blog post from SoundCloud explaining what it was.

A year later, Prometheus joined the CNCF as the second project after Kubernetes.

Why text? The deliberate simplicity behind Prometheus metrics

Mat Ryer: One thing that struck me when I first encountered Prometheus many years ago was that it was just a sort of text-based format. When you get the metrics, it's just text. That always struck me as either very naive, or someone’s really thought about this. Which one was it, Julius?

Julius: Obviously, someone really thought about it. So, there are multiple layers of history to this. If you look at what Borgmon did internally at Google at the time, that was also a text-based format. Though, curiously, the first pre-alpha version of Prometheus actually used the JSON format. It didn't work too well, because you needed to parse the entire JSON body.

Eventually, we came up with both a protobuf format and a text-based format, and then with Prometheus 2, the protobuf format was kind of ditched in favor of the text format. It works really well, and it's also highly optimized in the ingestion path, where we try to not allocate any memory for metrics we've already seen. And the really nice thing about this format is the low barrier of entry. You can even emit it from a shell script from an environment that doesn't have a full stack for modern programming languages.

Prometheus and Kubernetes: similar ideas, different paths

Tom Wilkie: I think it's a common misconception that Prometheus and Kubernetes grew up together, came from the same place, and were built to work together. They do work incredibly well together, but actually, the Kubernetes integrations in Prometheus came relatively late.

If 2012 is when you started working on Prometheus, it was three years before the Kubernetes service discovery. Is that right, Julius?

Julius: Well, yes, but we only announced Prometheus to the world in January 2015. Very shortly after, people from Red Hat added service discovery for Kubernetes, and then soon after that, the Kubernetes components themselves — API server, etcd, kubelet — added native Prometheus metrics. So compatibility came quickly, but not from a shared origin. So there was great cross-compatibility between the two systems pretty quickly.

Kubernetes is inspired by Google's Borg cluster scheduler, and then Prometheus was inspired by Google's Borgmon. So the tool to monitor services on Borg, and to monitor the Borg clusters themselves. It makes sense that they would work very well together, philosophically, with a label-based data model and service discovery.

The journey from 2.x to 3.0

Mat: So, hold on. We have Prometheus 3.0 — and then when was the last release? And why is it taking so long? It’s been seven years since 2.0?

Julius: Yeah, I mean, here we can really say how stable we’ve been. We’re an infrastructure project; we want to be somewhat conservative, and we want to be adopted and trusted by people. We actually managed to build enough things into Prometheus 2.0 that were well thought-out enough to survive for seven years without having to be broken — which was really nice for our users. And we made it really clear in our documentation page which elements of the Prometheus API surface are stable.

Mostly, that has meant that for the past seven years people could just get a new version of Prometheus, and it just keeps working.

Of course, we started accumulating more and more smaller things, and a few bigger things, that we did want to actually change in a breaking way in Prometheus. Then came the idea for Prometheus 3.0.

Performance gains — even while the world adds more metrics

Richi Hartmann: Personally, my favorite part of Prometheus 3.0 is probably performance. Even though this is a long-running program and project, and even though we have made major improvements over the years, we were still able to massively improve the overall performance, both CPU and memory.

It was between, I think, 3x to 7x in our tests between early 2.x versions and 3.x. So contrary to a lot of other software — in particular, end user-facing software that just becomes slower and more expensive and more of a memory hog over time — we actually got more performance out of something that is vital to most businesses these days.

Tom: I love that Prometheus is getting faster and more efficient, but I also feel like people are just throwing more and more metrics at it. If we're four times better with memory, people are just throwing four times as many metrics into it. So there's definitely this attitude of "Just add a metric for everything, because Prometheus can handle it." And while that's brilliant, it's also kind of leading to a huge volume of metrics.

Native histograms: when your visibility goes “HD”

RichiH: With the old histograms, you basically had to know the system properties to choose bucket boundaries, whereas the native histograms just do what you want automatically, more or less. So you don't have this “observe, improve, observe, improve” cycle, unless you were already a subject matter expert or had a rough inclination of what your data would look like.

Tom: I’m super excited about native histograms. It’s a big step forward, and I think it's a relatively unique approach that Prometheus has by storing the really high-definition histograms long-term. My understanding is most other systems, while maybe using the high-definition histograms for transport, they effectively sample and then only store the percentiles sampled. But we store the full raw histogram in a very, very efficient way, forever. And so you can go in ad hoc in the future and ask arbitrary questions like, "What was the performance on this day, at this percentile?"

The other thing I've really loved about the native histograms in production is we've actually learned a ton about the behavior of our services, which was previously hidden behind super over-aggregated data. You're all going to cringe at this analogy, but it’s like going from standard definition to high-definition TV. Now I'm actually seeing bimodalities where there's a group of requests that are 30 milliseconds and a group of requests that are 300 milliseconds. And I never knew that about these services before. I think it's fascinating.

Julius: The only thing I would add is that not only is it easier to configure and higher resolution, it's also way, way cheaper at the same time. It's way more efficient than the old histograms, because the whole histogram sample can be stored in one time series vs. many — like, one for each bucket. And then also the encoding of that sample is a very efficient binary. It's all very well managed and very efficient, and so it's just better on every dimension.

What’s next?

Tom: So tell me what's coming. When can we expect Prometheus 4.0?

Julius: I don't think we know yet. We want to be pragmatic about it, and we want to strike a balance between bumping the major version every year or so and waiting for another seven years. I think that was maybe a bit too long. But whenever there are enough features that kind of accumulated, or that would necessitate breaking something, we will get together and discuss and say, "oh yeah, it's worth cutting a 4.0."

Tom: I know one of the things the team's working on a lot is a new governance structure. What's the idea there, Richi?

RichiH: So that's a thing where we learned from OpenTelemetry. OpenTelemetry has a very, very open and inclusive governance, where the threshold of contribution to become a member of standing — or a voting member, in the terminology of OpenTelemetry — is very low. And this leads to an effect where, basically, you can defend to your manager or your spouse much more easily why you’re investing time in this thing, because you have a stamp of official approval. So the likelihood of people staying around and then doing more over time and contributing more is higher.

Prometheus has a pretty high bar. We've lowered it already quite substantially. Basically, anyone who is a maintainer of anything is already a Prometheus team member. But that is the main goal behind the governance change — we want to really broaden the scope of who can call themselves a member of Prometheus and just widen the contributor base massively.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com.

‘Grafana’s Big Tent’ podcast: Prometheus 3.0, native histograms, and the future of metrics

The origins of Prometheus: an observability side project

Why text? The deliberate simplicity behind Prometheus metrics

Prometheus and Kubernetes: similar ideas, different paths

The journey from 2.x to 3.0

Performance gains — even while the world adds more metrics

Native histograms: when your visibility goes “HD”

What’s next?

Up next

Related content

Related videos

Related docs

Related products

‘Grafana’s Big Tent’ podcast: Prometheus 3.0, native histograms, and the future of metrics

The origins of Prometheus: an observability side project

Why text? The deliberate simplicity behind Prometheus metrics

Prometheus and Kubernetes: similar ideas, different paths

The journey from 2.x to 3.0

Performance gains — even while the world adds more metrics

Native histograms: when your visibility goes “HD”

What’s next?

Related Content

Up next

Related content

Related videos

Related docs

Related products