How eBPF makes observability awesome
You might not think that a coffee concentration meter, the Hubble telescope, or bees have anything to do with each other — let alone observability — but all that means is you probably haven’t listened to the latest episode of “Grafana’s Big Tent."
All of those subjects came up during a conversation about eBPF, a technology that allows you to attach your own programs to different points of the Linux kernel. It’s also the basis for Grafana Beyla, an open source eBPF-based auto-instrumentation tool that helps you easily get started with application observability for Go, C/C++, Rust, Python, Ruby, Java, NodeJS, .NET, and more.
Leading the podcast discussion are co-hosts Tom Wilkie, Grafana Labs CTO, and Matt Toback, Grafana Labs VP of Culture, who are joined by Mario Macías, Grafana Labs Senior Software Engineer, and Liz Rice, Chief Open Source Officer at Isovalent, a leader in open source cloud native networking and security.
You can read some of the show’s highlights below, but listen to the full episode to find out more about eBPF, including and what it shouldn’t be used for, how using eBFP tools impact performance, eBPF for Windows, and even what eBPF has to to with Conway’s Game of Life.
Note: The following are highlights from episode 6, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.
Intro to eBPF and kernel basics
Tom Wilkie: We are here to talk about eBPF — in particular how EBPF makes observability awesome. I’m not sure how familiar our audience is actually going to be with the term eBPF, so Liz, what is eBPF? Why is it useful?
Liz Rice: The letters used to stand for Extended Berkeley Packet Filter, but it does so much more than packet filtering, so officially it doesn’t stand for anything anymore. What it allows us to do is dynamically load programs into the kernel to change the way the kernel behaves.
The kernel is the privileged part of the operating system that can interface with hardware. So your applications are actually asking the kernel to do things on their behalf, and the kernel’s also coordinating all your various different applications that might be running at the same time. The kernel is involved in everything that your applications are doing, and that means if we can add instrumentation into it — which we can, using eBPF — we can be in a really powerful place to observe and even affect what’s actually happening across the whole system and all of our applications, if we so choose.
Matt Toback: What did the world look like before?
Liz: You had a couple of choices: You could either write your own kernel, make bespoke changes to the kernel, or build your own kernel — and that’s not an exercise for the faint-hearted. Or, you could try to get your changes accepted into the upstream kernel — also, probably not an exercise for the faint hearted, because if you want something very bespoke, Linux is the most widely used operating system on the planet. You’d have to persuade the Linux community that your change is good for everybody, not just for you.
Your other option would be a kernel module, and that’s a completely legitimate choice, a way of having a custom extension to the kernel. The only problem is that when the kernel crashes, it brings down your whole machine. There is no rescuing a crashed kernel. All software has bugs, and if there’s a bug in the kernel module, it could easily crash the kernel. A lot of people have shied away from using them or using other people’s third-party kernel modules because of the risk that it’s going to crash your machine.
We all know what crashed kernels look like, because we all saw the blue screen of death plastered across airports and translations worldwide recently. That was Windows, but it’s the same principle. Crashing the kernel is a bad day for a computer.
The difference with eBPF is the programs that we load into the kernel go through this verification process to make sure that they are safe to run and won’t crash the kernel. And that’s the big step forward between kernel module programming and eBPF.
Examining use cases
Tom: One of my pet peeves about eBPF, or Rust, or blockchain, or AI/ML is people are focusing on the technology and not on what problems it solves. What are people using eBPF for?
Mario Macías: When you inject a program in the kernel and can see the memory of the kernel at a given state — or even of the network’s stack or of an application — then you can get runtime information that might help you provide visibility of what’s going on inside the kernel or your application. This is for observability, but even for security. You can subscribe for multiple events and in your own system, then you can create a trace log.
More recently, there are people suggesting that eBPF could be used for hot patching of your system so you can fix security or stability issues.
Liz: There are a couple of cases of that. One is the idea of a packet of death. Maybe there’s a bug in the kernel that means a particularly formed network packet has an incorrect buffer length or something. If you craft a packet in that way, it will crash the kernel. There were mitigations distributed in the form of eBPF programs that can look at that packet before it gets processed by the stack so the packet can be discarded if it is a packet of death.
I also believe there were some mitigations around some of the Spectre-related vulnerabilities that could be distributed in eBPF form.
Matt: As eBPF has grown in popularity, has there also been a growing amount of folks that are using this as an attack vector?
Liz: Yes. Everything is an arms race, particularly in the security world. You can do all sorts of incredibly powerful things with eBPF, and if you are a bad person, those could be malicious things. Don’t run any old eBPF that you just pull off the internet without checking where it came from.
The benefits of eBPF
Tom: Why would I build a network in eBPF? Why wouldn’t I just use the capabilities that exist in the Linux kernel?
Liz: A lot of it is about performance and scalability, particularly when I talk about cloud native and container networking. One of the things that we can do with eBPF is essentially short-circuit some of the networking paths to miss out bits that really don’t need to be run.
For example, if you’ve got a container, it’s typically got its own network namespace and it’s running in a host that also has its own network namespace. A packet coming into that machine through the ethernet connection is going to traverse the networking stack traditionally in the host namespace, then go across a virtual ethernet connection into the container net space and then get processed again. Essentially, you can end up with duplicate processing.
With eBPF, we can be smart about only really doing that processing once. In fact, there’s a new capability called netkit, where we can now achieve zero overhead networking, essentially bypassing all of the excess network processing overhead and get exactly the same networking performance out of a container as you would from just running an application on the host — which is quite a significant jump in performance from traditional container networking.
While we are processing a packet and deciding where to send it, we can also apply things like security policies to those packets, so it’s super high-performance.
eBPF at Grafana Labs and beyond
Tom: What are we doing with eBPF at Grafana Labs?
Mario: We have a couple of products using eBPF. One is Grafana Beyla, which attaches eBPF programs at different levels of the operating system, including the network stack, and is able to instrument your running applications, your running service, in different protocols — either services and clients — and provide application-level metrics like requests per second, errors, duration, and so on. In the case of a HTTP, it’s also able to decorate that information with the path, the request size, the return code, and so on.
The advantage of that is that you don’t need to modify your applications or reconfigure them or redeploy them. You have your running applications deploy Grafana Beyla in a host, and automatically — thanks to eBPF — those small eBPF programs that are deployed at different levels will hook on different application events and provide that information.
Also, Grafana has another tool named Pyroscope, which is able to profile the performance of your application by function using flame graphs.
Tom: To make sure I understand correctly, Beyla is like an auto instrumentation agent written in eBPF, that you install on the machine. And then anything you run on that machine, you’re automatically going to get rich metrics and traces out of those systems without having to modify the binaries at all.
Mario: Yeah. It’s a normal application written in go that loads those eBPF programs. Those programs in the kernel space communicate with Beyla in the userspace, then Beyla reconstructs all the information and sends it.
By the way, it uses the Cilium eBPF library from Isovalent. It’s amazing. I think many other big projects are using them because it facilitates the life of eBPF developers a lot.
Tom: So Beyla’s not the only eBPF observability game in town. What are Isovalent and Cilium bringing to the observability space?
Liz: There’s a subproject within Cilium called Tetragon, which is security observability and also enforcement, optionally. So this is really focusing on observing events that are outside of some security policy if you are interested in detecting and possibly preventing suspicious looking activity.
One of the really nice things is that in Tetragon, we can filter those events within the kernel and only actually report the events that are outside of policy. For example, for a long time you’ve been able to use eBPF to, let’s say, report on somebody trying to open a file. There have been eBPF-based security tools that have then been able to take all of the file-opening events and filter them in userspace so that you can say, “Ah! Here is somebody trying to open your /etc/shadow
file, and that seems like a bad idea.”
Tom: For our non-technical listeners, /etc/shadow
is where your passwords are stored.
Liz: With Tetragon, rather than passing information about every single file access to see whether or not it is actually /etc/shadow
, we can do that filtering inside the eBPF program in the kernel, and that makes it dramatically more efficient.
Good news for platform teams (and application developers)
Tom: I think we’ve seen a shift where application developers inside large organizations no longer have to worry about where their software runs, how it’s scheduled, how it recovers from hardware failure — all these kinds of things. With eBPF auto instrumentation, is telemetry collection and instrumentation becoming a function of the platform as well?
Liz: I think that’s absolutely true. We’ve seen that a lot over the evolution of Cilium. Isovalent’s customers are typically in platform teams, and they have networking requirements they’re going to have to solve. They might have some security problems that they or a security team need to solve between them.
You can have so much of this embedded into the underlying platform, such that application developers don’t really need to care, and that they can have a service offered to them that gives them visibility into which services are talking to which other services, or that automatically gives you encrypted connections across your network. These are benefits that just roll out automatically to all of the application development teams.
Being able to observe how your applications are behaving becomes a service that the platform team offers to the development teams. I think that’s a very common model.
Tom: I guess the good thing about this is also it becomes more consistent across an organization as well, instead of every application team picking their own vendor, their own technology for doing their own understanding. In these large organizations where we’ve all gotta get along and we’ve all gotta talk to each other now — and that goes to our software as well — having a common observability, a common networking, a common security standard across the team is kind of useful.
Liz: And if you are only instrumenting in the kernel, the application teams don’t even need to do any work. It’s kind of free, essentially, for them.
Looking ahead
Tom: What does the future hold for Beyla, and what does the future hold for Cilium and eBPF?
Mario: For Beyla, we have some main lines of future directions. One is to make it even more automatic to get instant instrumentation. It’s already pretty simple to deploy, but we still need to help some users to get more simplicity. We would also like to get better information about some languages and protocols. We already support some protocols like Kafka, but we would like to add other message queues, other databases, and so on.
A third implementation line is to let Beyla run with even lower privileges — or at least to let the users be able to say, “Okay, I want to steal some capabilities even if I lose some functionality,” because there are some sys admins who are still reluctant to give up program admin capabilities.
Liz: With eBPF, there’s quite a lot of work going on at the moment around things like the verifier, the threat model, the security model, auditing the verifier, and so on. I think it’s getting quite a lot of attention, particularly now people seeing the dangers of tooling that directly accesses an operating system kernel. A lot of us in the eBPF community are saying, “eBPF is a great solution to that,” but we probably need more research and security analysis to really convince people that that’s the case.
One thing we haven’t mentioned, and I probably should, is that Isovalent was acquired by Cisco earlier this year. Cisco has been incredibly supportive of our team continuing to invest time in Cilium. Cilium is a Cloud Native Computing Foundation project. Because it’s a foundation-owned project, it’s a community run project, it’s no longer Isovalent’s alone, but we do invest a lot of resources into it. So the fact that Cisco is a hundred percent behind us continuing to do that is fantastic for all of us and it’s fantastic for the future of the project.
So Cilium will carry on moving from strength to strength. We’re very much looking at things like how you seamlessly network between your cloud native workloads and workloads that have been running in your on-prem environments for however many years or decades. You want to be able to access those and communicate between those and your newfangled Kubernetes workload, so that’s quite a big area of focus for us right now — and obviously Tetragon and the security that we’re providing with that. One of the ways in which that technology is being used is in a product that Cisco is building called Hyper Shield, which takes eBPF, and some fancy AI things, and some fancy dual data plane things.
Tom: We managed to go almost the entire podcast without mentioning AI.
Liz: Sorry, I failed at the last hurdle. AI had to be mentioned! [laughs]
“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.