Why we created a Prometheus Agent mode from the Grafana Agent
This feature has a bit of a lengthy history to it:
- On 2020-03-18, we announced Grafana Agent, a subset of Prometheus that didn’t need the TSDB for sending metrics with remote write.
- On 2020-12-17, Prometheus Team held a dev summit where they agreed to cover Grafana Agent-like functionality in the official Prometheus project.
- On 2021-01-27, I published a proposal for how we could do this by moving Grafana Agent code upstream.
- On 2021-05-05, the initial PR to move Grafana Agent code upstream was opened.
- On 2021-10-29, that PR was merged into Prometheus’s main branch.
It took a little while to get to where we are today, but I’m thrilled that we were able to use the Grafana Agent code to enable agent-like functionality in the prometheus/prometheus repository. In his announcement blog post, Prometheus maintainer Bartłomiej Płotka wrote that Agent mode enables easier horizontal scalability for ingestion, making it “a game changer for certain deployments in the CNCF ecosystem.”
Grafana Agent had a single goal when we first announced it in 2020: become a cloud-optimized subset of Prometheus that used the same battle-tested Prometheus code. This worked by trimming down Prometheus into an “agent mode” — reusing all Prometheus code except anything that relies on the TSDB.
This gave it a few unique trade-offs. You use less memory and storage when using a Prometheus-compatible remote write endpoint, but you locally lose features that relied on the TSDB, like querying, recording rules, and alerts.
Nineteen releases later, a lot has changed! We’ve since added support for embedded Prometheus exporters called integrations, a scalable mode for operators hosting the Grafana Agent as a service, support for Grafana Loki-compatible logs, and finally OTLP traces support for Grafana Tempo.
Today, I would say Grafana Agent now has three primary goals as a project:
- Be the best companion to an opinionated open source telemetry stack: Prometheus for metrics, Grafana Loki for logs, and Grafana Tempo for traces. These three projects have a lot in common and play really well together.
- Enable new use cases (ways in which someone would use Grafana Agent to forward telemetry data) that empower the first goal. Both host filtering mode and the concept of having a Prometheus without storage are examples of this.
- Share proven use cases with the broader ecosystem to strengthen open source observability as a whole. We’ve done this when we shared our SigV4 support with Prometheus, and Prometheus Agent is yet another example of this.
I won’t pretend that the first goal isn’t vague. “Best” is subjective, though I consider it to be measured by a combination of performance, compatibility, and enabled use cases. The enabled use cases carry over into the second goal, which is also where I think the project can be the most exciting for users, maintainers, and contributors.
Of course, we’re not claiming to be masterminds of innovation and bravely tackling use cases that nobody else is supporting. Projects like the OpenTelemetry Collector are simultaneously supporting a similar Prometheus Agent use case. Based on this, you might be wondering if it makes sense for two similar solutions to exist in parallel.
I would argue that open source enables something I’d like to call “asynchronous collaboration.” The idea is that multiple projects can work together by exploring multiple approaches in parallel. This parallelization eventually leads to stronger solutions that benefit everyone. Asynchronous collaboration is more obviously known as “competition,” but I’d like to give it a unique term here to demonstrate its benefits to the observability ecosystem.
Asynchronous collaboration allows projects to narrow in on solving a problem from multiple perspectives, allowing for faster iteration towards a more mature approach. Right now, Grafana Agent and OpenTelemetry Collector take different trade-offs, and it’s too early to know if there’s a future where they converge. As long as they exist in parallel, the two solutions can iterate in parallel and learn lessons from the other solution.
This type of collaboration is something that can only be done with open source and open standards, and I strongly believe it leads to a better future compared to starting with a singular approach today.
When to contribute?
Of course, asynchronous collaboration only holds value if there is an intent for some kind of end, whether it’s through code donation or converting two different ideas into one new one. Once the cycle ends, both projects become stronger through supporting broader use cases across a wider community.
I believe Prometheus Agent is the largest recent example of the code donation route. We’ve proven the agent concept (i.e., the TSDB-less core of the Grafana Agent) for Prometheus over the last 20 months. With the concept proven, it was a perfect time to end the cycle and move it upstream. Moving the code specifically to the Prometheus project makes sense, given how the Grafana Agent metrics code is fully dependent on other Prometheus code.
Prometheus officially supporting the agent use case opens the door for new agent-specific features that may not have made sense before. These features could also eventually make it into Grafana Agent, strengthening both projects.
You might also imagine how ending an asynchronous collaboration cycle invites more cycles: There’s now an opportunity for both Prometheus and Grafana Agent to explore new use cases in parallel.
Now that Prometheus Agent is merged, the first step is to dedicate our time to help it become as stable as Grafana Agent has been.
Next, we want to help share the code with projects that could benefit from it. For example, Thanos (for the Thanos Ruler) currently uses a copy of the Grafana Agent metrics code and can now be changed to use the official upstream solution.
Of course, we’ll also update the Grafana Agent to use the Prometheus Agent code so there’s only one code base.
Finally, we’ll start discussions around other agent-specific use cases we’ve identified that can be shared with Prometheus Agent.
Given the tightly defined scope of Grafana Agent, I’ve always liked thinking of it as the perfect home for enabling new use cases that will one day benefit a broader audience.
I’d love to welcome anyone to contribute their ideas and use cases and help kickstart as many asynchronous collaboration cycles for the ecosystem as we can. If you want to get involved, please bring your ideas as an issue to the Grafana Agent repository, have a chat with us on the public #agent channel in our Slack, or join our monthly community calls.
Of course, Prometheus’s own community of developers deserves a shout-out here. Please visit their repository if you’re interested in contributing with your own ideas, issues, or code for Prometheus Agent. Similarly, if you’re interested in supporting OpenTelemetry’s agent use case, their repository contains a contributing guide to help you get started.