Grafana Beyla: what’s new and what’s next for the open source eBPF auto-instrumentation tool
It’s been a year since Grafana Labs announced the general availability of Grafana Beyla, our open source OpenTelemetry and Prometheus eBPF auto-instrumentation tool to help you easily get started with application observability.
As a Beyla maintainer, I wanted to take a minute to reflect on what we’ve accomplished with Grafana Beyla since then, what we have learned about supporting an eBPF tool in production, and, in general, how exciting this whole journey has been.
Looking back: what we’ve accomplished with Beyla
We’ve grown a community
Working on this project has almost felt like running a startup within Grafana Labs, but after nearly two years of that work, I really don’t see Beyla as a Grafana-only project. We have ten times more external contributors than Grafanistas contributing, and many of those external contributions have significantly shaped Belya, as we know it today. David Ashpole, for example, contributed the patches required to run Beyla with a minimal set of permissions; Khushi Jain contributed the Beyla Helm chart, which is by far the most common way people install Beyla today; and Darek Barecki contributed our direction detection code for the network observability component.
If I had to pick one, I would say building an open source community around Beyla has been our greatest accomplishment since our 1.0 release a year ago. We owe so much to our community members for bringing fresh ideas and perspectives, and reporting bugs!
We’ve made application and service graph metrics easy
Since day one, Beyla has supported both metrics and traces, but metrics is where Beyla has really shined. With our multi-process support and language-agnostic approach to observability, Beyla has become the go-to tool for easily getting started with OpenTelemetry. From an end-user perspective, they simply add Beyla to their cluster and gain immediate visibility into what’s going on with their applications.
Since our 1.0 release, which only tracked the HTTP and gRPC protocols, we’ve added support for HTTP2, SQL, Redis, and Kafka, making Beyla a much more useful observability tool. We started tracking network and connection metrics, which allows users to build solutions for service graphs. We worked hard to pick the right defaults, balancing ease of use with generating too many metrics.
After all this effort, it was great to see the full OpenTelemetry Demo instrumented with a single Beyla daemonset deployment earlier this year. The one Beyla instance produced service-level application metrics for all the different technologies and protocols used to implement the services in the OpenTelemetry Demo. We tested the OpenTelemetry Demo by stripping away all existing instrumentation code and monitoring services, while keeping the bare-bones uninstrumented applications talking to each other.
Below, we show a screenshot of RED (rate, error, and duration) metrics, as well as service graph metrics, generated for the OpenTelemetry Demo Checkout service directly from Beyla.
And here, we show the Asserts RCA (root cause analysis) Workbench, using data correlation between network and application metrics, generated by the single Beyla instrumentation instance.
Metrics… check. What about distributed traces?
While we made metrics easy, our distributed traces support has remained limited, which has been an interesting customer support experience. Users expect it to be as easy to get started with distributed traces as it is with metrics. And why not?
There are a number of eBPF quirks that make writing user memory difficult, and we absolutely need the ability to write memory for trace context propagation. This was the focus of my KubeCon North America 2024 talk, “So you want to write memory with eBPF?” Having said that, we’ve done a lot of innovation and research in this area and we expect that we will be able to provide full generic distributed trace support in our upcoming 2.0 release. Most of the patches for generic distributed traces are already in our codebase, so very soon they’ll be on by default and fully supported.
We’ve learned how to scale and deploy eBPF in production
Resource utilization matters in large clusters
While our 1.0 release worked, it wasn’t totally optimized in terms of resource utilization. To be able to deploy Beyla at Grafana Labs in our production environments, we had to do a lot of memory and processing optimizations. Deploying eBPF probes for every single instrumented application was never going to scale, so we had to learn how to share everything, from eBPF memory maps to the actual probes.
eBPF limitations can be tricky
Sometimes, the limitations of certain eBPF features are buried in hard-to-find documentation, so we learned a couple of things the hard way — for example, that kernel return probes are limited and can not be relied on, or that setting up user return probes on applications that can move the stack will cause the applications to crash. There are many different ways to accomplish the same thing in eBPF and the right approach will change over time, depending on your growing feature set.
Users care about eBPF permissions
eBPF has a bad reputation in certain circles because of the overall perception that it requires super user privileges, privileged containers, or the equivalent of system administrator access. This hasn’t been true for a while; however, sometimes perceptions are difficult to change.
While building eBPF tools, it’s easy to simply run everything as a privileged container or with super user privileges. However, a lot of users have an issue with privileged containers in production, and rightfully so — it’s a security risk.
We had to do a lot of work to structure our code and features in a way that we always ask for minimal permissions, and gracefully degrade the functionality if certain permissions are not given to Beyla.
Looking ahead: what’s next for Beyla
We are proud of the community we’ve built over the last year and have big plans for the future. Here are a few things we’re already working on.
Leveraging the OpenTelemetry eBPF Profiler
We’re thrilled to see the incredible progress made on the OpenTelemetry eBPF profiler. This new tool is the result of a collaborative effort across many contributors and represents a significant step towards establishing profiling as a core OpenTelemetry signal. It lays a strong foundation for integrating traces and profiles, allowing us to provide deeper, code-level insights without requiring manual instrumentation.
We believe that combining traces and profiles is the future of observability. Profiling delivers contextual stack traces that complement tracing, offering more precise insights than manually added spans ever could. After all, no matter how many trace points you add, there’s always one missing when it comes to pinpointing the root cause of a slow transaction.
Better support for the top five programming languages
Beyla’s strength has always been the ability to instrument applications written in any programming language. We don’t want to shift away from that focus. However, we do intend to add better language support for the top five programming languages (Java, .NET, NodeJS, Python, and Go) for web service developers. This will allow us to further enrich the insights Beyla is able to provide out of the box. For example, we are hoping to be able to provide basic JVM runtime metrics, without any JVM agents or logging enabled.
All in all, we’re really excited about how Grafana Beyla has evolved over the past year. Thank you to all of our community members who have contributed to the open source project — and we look forward to what’s to come!