We’ve been experimenting with new ways to use and operate Prometheus over the past year. Every successful Grafana Agent experiment turns into an upstream contribution for the whole Prometheus community to benefit from.
In this blog post, I go over the history of the Agent’s successful — and not so successful — experiments.
Brief history of Grafana Agent
The initial version of Grafana Agent that we announced in March 2020 was pretty minimal. We essentially just took Prometheus and removed the TSDB storage, which is something you may not have wanted to use if you primarily consumed your metrics through the remote write endpoint. But that was just one-third of what we ultimately wanted to build: a single Agent for all of Grafana Labs' opinionated observability stack.
We were able to complete that goal across the remainder of 2020. In September, we released v0.6.0 with support for Grafana Loki logs, and a month later we released v0.7.0 with support for Grafana Tempo traces.
Over time, the Grafana Agent turned into a lot of glue code that tied a bunch of different projects together. Despite this, we wanted to try to see what we could build on top of the glue, and we’ve done this by layering experimental features on top.
Experimenting with Prometheus
I’ve always been interested in the idea of being able to run an Agent on every single machine in my environment and have it seamlessly collect logs, metrics, and traces without having to think about sharding or scaling.
Our first experimental feature was done to try to achieve this vision and was present all the way back in the initial release’s “host filtering mode.” Host filtering mode made Agents filter out any target that wasn’t running on the same machine as the Agent. This allowed you to run one Agent per machine, as long as it had the same config file. Unfortunately, this required every Agent to perform the same service discovery. That fell apart quickly in large Kubernetes clusters, since Agents would start to overwhelm the API with requests to watch resources.
Two months later, we swung the sharding pendulum in the other direction, exploring a “scraping service mode” in May’s v0.3.0 release. The scraping service mode allows you to cluster your Agents and have them distribute discovery and scrape load by reading configuration files from a configuration store.
Since Agents were only performing service discovery for configuration files they took ownership of, this fixed the service discovery problem. But it introduced a new problem: If one of your configuration files discovered one million targets, and another file only discovered one target, the million targets would be assigned to a single machine. This was due to the dependence of static sharding of config files instead of runtime sharding of discovered targets.
We’re still working on this experiment, and we will keep refining it over time. We’ve recently published a design document for distributing scrape load amongst clustered Agents and another resource for making the configuration store optional. These are two proposals that I believe, once combined, will make it significantly easier to distribute scrape load without having to think about how many Agents you run or how to partition your Prometheus servers.
We’ve explored some other features, too. In June’s v0.4.0 release, we introduced the idea of “integrations,” which bundle in the most common Prometheus exporters, so you can get started with observability faster. And in January’s v0.10.0 release, we introduced support for Amazon’s Signature Verification Process Version 4 (SigV4), which allows the Grafana Agent to write to Amazon Managed Service for Prometheus without the need of a proxy.
Upstreaming successful experiments
You could say that the entire concept of a remote write-only Prometheus was the original experiment. We had a pretty good idea that users of Grafana Cloud — our observability platform integrating metrics, logs, and traces with Grafana, managed as a service — didn’t want to use Prometheus if they were just sending all of their data to us, but we had no idea how popular it would be outside of that. The Agent concept turned out to be a success, and users of open source projects like Cortex found valid use cases for it. The demand grew so much that on December 17, 2020, the Prometheus team agreed to cover Agent-like functionality in the official project.
This was our first opportunity to take our experiments and make them official. One month later, I proposed kick-starting the Prometheus Agent by contributing our code upstream, which we’ve received consensus for and are planning to do over the next few months.
SigV4 was another one of our smaller experiments that was proven to be solid pretty fast, and in January, the Prometheus team also agreed to experiment with SigV4 as part of the official project. Since this is a smaller piece of code, we contributed this one first, and the PR for that is currently in review.
Future of Grafana Agent and Prometheus Agent
Over time, I’ve grown to really like the model of doing extremely experimental things downstream. The Grafana Agent user base is always going to be much smaller than the Prometheus user base, which gives us a less-populated sandbox to play in as we try new things. With fewer use cases to consider, we can throw feature spaghetti at a wall and contribute only the finest, wall-sticking spaghetti upstream. We have the ability to iterate quickly, learn from our mistakes, and build up a case for features that the broader community may be able to benefit from.
Parts of Prometheus are used as libraries by more and more projects and vendors, so Prometheus tends to more conservative on the inner workings of the codebase. In contrast to this, the Grafana Agent is focused on delivering stable solutions for customer requirements as soon as possible. This setup benefits everyone: Users of the Grafana Agent get stable features immediately, while users of the Prometheus Agent get only features that they can vendor in as a library.
As an example, the Grafana Agent has included several exporters and their configurations as stable features since June 2020, whereas Prometheus published design documents for feature parity this March.
I’m thrilled to keep playing around with new things, empowering users to easily collect their metrics, and contributing everything that works to Prometheus. If you want to help us with these experiments, we’re hiring!
Thanks for reading!