The future of Prometheus and its ecosystem
During PromCon, I gave a talk titled “The Future of Prometheus and its Ecosystem.” I want to share the key highlights with you.
Since PromCon 2019, we have seen quite a bit of movement in Prometheus.
2.14 saw us introduce a new React-based UI. And while it has not reached feature parity with the current UI yet, it will enable us to actually work on and improve the UI once again.
2.15 is largely defined by the metadata API. Prometheus exposition format has allowed for HELP, TYPE, etc. for ages, but Prometheus used to simply discard the data. Now that we keep it, we’re also externalizing it through an API; for example, Grafana is already taking advantage of this to give users extra context about the time series they’re working with:
2.16 was largely around various improvements and stability, though both the ability to select non-UTC time zones in the UI and the query log feature have been requested since 2016; it was nice to be able to close those.
2.18 improved tracing and support for exemplars in Prometheus exposition format. Exemplars are the first user-visible effect of OpenMetrics on Prometheus. Combine the two with Grafana, and you’re able to jump from this:
2.19 reduced memory footprint by a cool 50%. It’s both exciting and scary that Prometheus is already so efficient but still has so much more optimization potential. I think this graph captures this beautifully:
2.20 has the longest changelog of any release since 2.6(!). The highlight feature is probably our native support for service discovery for Docker Swarm and DigitalOcean. But what’s more important than the fact that we support two random SD mechanisms is that we’re uncrusting parts of Prometheus and taking a fresh look at many old decisions and consensus positions. The world has changed – we arguably changed the world – and we need to take this into account, here and in other places.
Wrapping up the recap, node_exporter has reached 1.0 and includes EXPERIMENTAL TLS support. The Cloud Native Computing Foundation sponsored an audit by Cure53 for node_exporter in general and our TLS implementation in particular. This carries two times the bang for the buck: Not only do we cover TLS before copying it to other exporters, but node_exporter also tends to be our default experimentation exporter from which we also copy other patterns.
I sometimes get the feeling that we, as a project, are resting on our laurels. Some time ago, I started a Grafana-internal brainstorming with the motto “Prometheus is not feature-complete” and asked Red Hat to do the same. We tossed this into a Prometheus-team-wide, widely-scoped brainstorming document about ALL the things. It serves as a basis for discussion with specific topics broken into dev summit discussion points once they’re ripe.
Last year, we had two dev summits: one after KubeCon EU, one after PromCon. I planned to do the same in 2020, but COVID hit. We hadn’t had one this year yet, and I think we found a solution: shorter, more often, and virtual. We are doing 4-hour blocks of dev summit instead of 1-2 full days all at once. Our first dev summit was on 2020-07-10, and the next one will most likely be on 2020-08-07. Will keep doing them until our backlog is empty; and the backlog is growing and growing as we keep adding more items from the brainstorming doc.
I want to do two things now:
I want to walk you through what we agreed. I don’t think it would be particularly useful, or fair, to walk through what we’re still brainstorming on. The order of this is the order we walked through the topics. This order is determined by our quickly voting on everything and talking about what got the most votes first.
And second, I want to share a bit of background of how I run those meetings and what some of the underlying considerations are; matter of fact, I will probably write a blog post or two about chairing working sessions soon.
Metadata is starting to be useful in Prometheus (see 2.15 above). We need to do more with metadata; e.g., propagate it through remote read-write. Something not covered by the consensus below are the interesting questions around what to do if metadata changes/flaps or if it’s abused as an attack vector.
CONSENSUS: We want to support more metadata. Work will be done in [design doc](https://docs.google.com/document/d/1XiZePSjwU4X5iaIgCIvLzljJzl8lRAdvputuborUcaQ/edit). CONSENSUS: https://github.com/prometheus/prometheus/pull/6815 is acceptable as an EXPERIMENTAL stopgap. It is likely to change for 3.x.
Workflow changes and s/master/main/
The janitorial stuff around our workflows is self-explanatory, but the verbiage cleanup might not be. We’re serious about spearheading verbiage cleanup; while it’s not the most important thing we can do, it’s something we can do. We’re waiting for GitHub to provide tooling, and once that’s done, we will try to offer the work for a paid intern through Community Bridge. I am talking with the Linux Foundation and CNCF to potentially do this across all projects. In between touching a myriad of projects, touching a lot of different tooling, and getting to know a lot of people, it’s an awesome opportunity for someone trying to get into the space. Contact me if you’re interested: @TwitchiH’s DMs are open or richih at thiswebsite.com.
CONSENSUS: Set “Require status checks to pass before merging” (see figure above) on all prometheus/… repos. Prevent direct pushes to main branch? Prevent force pushes to main branch? CONSENSUS: Disable force-pushes to all main branches. CONSENSUS: Default behavior is to allow pushes to the main branch, but it should be disabled for certain “important” repos, e.g., prometheus/prometheus, at the discretion of the maintainer.
This is one of the oldest feature requests and a good example of how to carefully approach consensus. There are a lot of different opinions within Prometheus-team on this one, and it’s hard to find common ground. The way I approached this was to write down a limited and very specific consensus position: “We want to support backfilling over the network at least via blocks which do not overlap with the head block” carries three qualifiers. After a lot of back and forth, and my trying to find a consensus position again and again, it was clear that we couldn’t find one immediately. So I made a call for a non-consensus, which was: “We want to support backfilling over the network at least via streams which do not overlap with the head block”. Only by forcing everyone to voice an explicit opinion on this one were we able to reach the final one: “We would like to support backfilling over the network via blocks which do overlap with the head block given proper design”. Every single one of those words has been very carefully used to express the precise extent, and limits, of the consensus.
CONSENSUS: We want to support Prometheus text exposition format and OpenMetrics, and one well-defined CSV format for importing data. CONSENSUS: We want to support backfilling over the network at least via blocks which do not overlap with the head block. NOT CONSENSUS: We want to support backfilling over the network at least via streams which do not overlap with the head block. CONSENSUS: We would like to support backfilling over the network via blocks which do overlap with the head block given proper design.
Another one of the more janitorial topics. As a side note, this is one of my criticisms of Go: It has been designed in a world in which a single mono-repo is the norm. Google keeps all, or most, of its general code in one repository. Which has a lot of advantages, but which does not translate well into the real world. Go is slowly but surely moving away from this legacy.
But there’s another thing here. I wrote down the consensus position below almost at the start of the discussion. It was clear that we would at least try it. It was clear that Ben Kochie would volunteer. And it was clear that we would try it in node_exporter. We usually try workflow improvements, it’s always Ben who volunteers, and node_exporter is the template we copy into other exporters. Yet, it was important to let the discussion run for some time and let people come to this conclusion on their own instead of putting a finished position in front of them.
CONSENSUS: Remove it in node_exporter, see if we like it.
Mailing lists & IRC
Google is blocked in China, and our mailing lists run through Google. We agreed to try and make sure people can subscribe via email. I tested this afterwards and firstname.lastname@example.org works. https://email@example.com/maillist.html for archives if you need them.
Plus, we made sure that you can use IRC through Matrix as some of us violate xkcd 1782.
CONSENSUS: See if it is possible to subscribe to our MLs without a Google account and enable this; still require subscription for posting. Link to public ML archive in our docs/community section, and explain how to subscribe without Google account. Disable identification requirements in IRC channels to make using Matrix easier; revert if there are any attacks.
Let me quote myself from the dev summit: “The first thing I disliked about Prometheus in 2015 was the documentation; in 2020 I despise it. It’s not human-friendly and almost useless if you don’t already know what you’re doing or at least have all the concepts down.” Long story short, we will work to improve this:
CONSENSUS: We will restructure the documentation into three sections. User manual, default, low threshold for contribution. Reference, what we currently have with PromQL examples of in/output. Guides, preferably on a modular basis. We will ask Diana Payton for a style guide, overall concept, etc. for all documentation. We will try not to break links unless it really makes sense.
And if you would like to work on this, we’re currently looking for a tech writer for Grafana Cloud who would also be working on the official, upstream Prometheus documentation. We do take our community commitments seriously, after all.