eBPF application instrumentation for Java: challenges, design, and real-world examples

Java is one of the most widely used programming languages for enterprise applications, powering everything from monoliths to large-scale microservice architectures. Frameworks such as Spring Boot and Quarkus, together with a rich ecosystem of ORM, messaging, and communication libraries, have made application-level observability relatively straightforward through the OpenTelemetry Java agent.

However, there are many real-world scenarios where modifying application code or JVM startup parameters is not possible. The OpenTelemetry eBPF Instrumentation (OBI) project focuses on these cases, offering a powerful alternative by enabling observability without code changes or JVM configuration modifications.

In this talk, Grafana Labs Principal Software Engineer and a maintainer of OBI, Nikola Grcevski and Causely co-founder Endre Sara, explore the challenges of instrumenting Java applications using eBPF, including the diversity of JDK distributions and versions, differences in JVM internals, and the combinatorial explosion of frameworks and libraries that makes generic instrumentation difficult. The problem becomes even harder when applications communicate over TLS-encrypted protocols, such as HTTPS, gRPC, encrypted database connections, and secure messaging systems, where payloads are opaque to traditional eBPF techniques.

This session explains the design decisions and implementation details behind OBI's approach to these challenges, including how OBI correlates low-level kernel events with higher-level application semantics and how it deals with encrypted communication paths. Nikola and Endre also discuss the trade-offs involved, highlighting what information can be reliably extracted today and where limitations remain.

To ground the discussion, the talk includes several real-world examples: a Spring Boot application communicating with Keycloak over HTTPS, a Spring Boot application using gRPC to interact with Google Pub/Sub, and a Quarkus application using TLS-encrypted PostgreSQL and Kafka.

Nikola Grcevski (00:01):

Hey everyone, how are things going? Good, I see people leaving already, so.

Endre Sara (00:07):

All that Java.

Nikola Grcevski (00:08):

All that Java, nobody cares about Java. My name is Nikola Grcevski I'm a software engineer at Grafana Labs, and here I'm with-

Endre Sara (00:16):

Endre Sara, I'm a co-founder at Causely. We are providing a causal reasoning platform on top of semantic data. If you're interested, feel free to find me afterwards, but here we are talking about-

Nikola Grcevski (00:27):

Yeah,

(00:28):

talk about Java, yeah. And OpenTelemetry BPF Instrumentation. Okay, let's go.

Endre Sara (00:34):

So, why are we here? I think Java is still the most popular programming language, especially in large enterprises, financial institutions, with many popular frameworks like Hibernate, various asynchronous messaging libraries, Spring Boot, Quarkus, and also the OpenTelemetry Java SDK is probably the most comprehensive and the most mature instrumentation library, but it is not easy to get it into production all the time. There are several challenges that we want to talk through in terms of why are we talking about Java, what are the specific challenges around Java, especially diving deep into TLS encrypted traffic and an overview of Obi, and then moving on to a couple of examples that we share on how to actually try this out if you want to try this out around this.

Nikola Grcevski (01:35):

Yeah, so tell us about some of the problems that you encountered trying to

(01:39):

instrument

(01:40):

Java applications.

Endre Sara (01:41):

So, one of the most obvious thing is how do you get the native auto instrumentation into production. For example, we have one of our large customers, as we engaged with them, within weeks they had Java instrumentation in their staging environment, and then it took several months, more than six months, to get the first production application instrumented, not because of any technical reason, just simply operational reasons of how to change the code. And then a number of other customers, there are a bunch of third-party applications that you simply cannot touch because you don't have the ability to rebuild them or to change the way they start. Similarly with legacy system, older versions of JVMs, and then we're also working with a number of financial customers where injecting code into the runtime of the application is difficult, just being able to measure the risk, the impact of those things.

(02:34):

So for a number of these things, while I think it is an amazing capability to be able to use native auto instrumentation, sometimes people just simply don't have the luxury to do so.

Nikola Grcevski (02:46):

Yeah, and on top of that, there's a whole ecosystem of people running GraalVM native, where they wanna shave, I don't know, a couple of hundred milliseconds of a JVM startup so they build their applications into native from Java and the agent can now be loaded. Okay, so let's talk about eBPF and Java. This has been a sore point for a project that I worked on since the beginning, Grafana Bela, and Java, we were never able to do well. And there's a couple of reasons for this, but primarily has to do with how to instrument user space. And in eBPF, which is a technology that allows us to build probes into existing running Linux kernel and existing running applications, you can set user space probes. But for that to happen, you need to have a file. Everything is a file in Linux. And unfortunately for us, Java generated code sits in anonymous memory regions.

(03:48):

They're not files. So you can say, well, okay, maybe you can do some hacks such as ptrace into Java, stop it, remap its memory regions into memory map files. That's doable. We've actually attempted and it works to some extent, but Java is a dynamically compiled language, which means run to run, Java does not have the same code. And the reason for that is because it starts by interpreting the bytecode. And at this stage, it collects profiling information about what's important to optimize that drives the compiler that optimizes the code in multiple stages. So your code is actually never same from run to run on any Java application. So building instrumentation at assembly level, essentially a binary level, will make us kind of cast a really wide net of where we have to put these probes. And I don't want into a business of, oh, I ran this application once and it worked.

(04:50):

Next time I ran it, there was no instrumentation. So it was an approach we never saw viable. So we wanted to solve primarily a TLS problem that Endre mentioned. And initially when this problem was brought up around Java, I said, it's not that important. There's not many Java servers that run TLS. And I was wrong. Not Java servers, but anytime you touch any cloud service, let's say you are getting a cloud managed service from your favorite cloud vendor, that's gonna go over TLS. So even though that many people terminate TLS at low balancers, there's still quite a bit of people that will maybe even use Java TLS on the server, but as well, most client calls to external services are done through TLS. So it's a very big problem. So for a long time, we needed to solve this to be able to do this with eBPF. An environment where you can kind of just plug stuff in, don't have to modify any configuration or any environment variables or anything like that.

(05:55):

It should just work out of the box, no restarts. So this is a summary of what we ended up doing. And let's walk you through the steps of how we achieved this. This can get pretty technical, so bear with me. All right, so OpenTelemetry eBPF instrumentation at a glance. This used to be known as Grafana Bela, that product still exists. We just donated to OTEL about a year ago, and we just built Bela on top of this. And it uses eBPF, it installs kernel probes, network programs, user probes to try to capture as much telemetry at the kernel level, and then converts this telemetry into traces and metrics after enriching it from various signals, such as Kubernetes, now we have cloud metadata, we're enriching with host-level information and so on. This data then you can transform further by using the OpenTelemetry Collector or Allo, if you wish.

(06:52):

And so this next picture is sort of a diagram of how this works in practice. But essentially, this software you install as something on your Linux kernel, like a system process, which translates to something like a daemon set in Kubernetes. And once it plugs into the all levels of your favorite applications, it can monitor multiple at a time. And then this telemetry, once it's enriched, you can either send it to an OTLPN point somewhere in the cloud, or enrich it yourself, keep it local further into your own on-prem databases and so on. So how does this instrumentation happen? This is what we were able to do a year ago. When I presented something similar presentation to this, like GrafanaCon last year, we had this scenario. And in this case, I'm talking about a Java server that's handling HTTP, REST APIs, and at the backend is talking to Postgres.

(07:52):

So if you will, with the instrumentations we do typically, we get the protocol show up in the kernel buffers. We read those and we can say, "Oh, that's HTTP on the incoming side. We extract the route, we extract the method, the payload, the headers, all this information, and then we have HTTP." On the backend, we see, "Oh, that protocol looks like Postgres." We parse, we extract the SQL statement. From the SQL statement, we find the operation, we find a database name and so on. This is all good. And for the last bit over here, I'm assuming a very simplistic scenario where it's the same thread that handles the incoming request and the outgoing Postgres request. In that scenario, we're able to correlate between the two, incoming and outgoing. Now, unfortunately for us, this is not a situation that most Java applications actually run in.

(08:50):

It looks like this in most cases. So the first and foremost, I'll talk about the threads. It's rarely that an incoming request is handled by the same thread as the outgoing request. Java applications typically have thread pools. So we have an incoming thread pool that handles all the HTTP requests, and we have an outgoing thread pool that handles the JDBC connections. Now, what about the traffic that a kernel sees? We no longer are able to see the actual payloads because it's just gibberish. It's TLS encrypted. So while previously for other languages, we can tap into LibSSL or BoringSSL and so on, Java implements TLS in Java. So there is exceptions to this rule where people actually use purposefully OpenSSL or BoringSSL through JNI, but the norm is you use the Java TLS encryption libraries built into the Java JDK itself. And therefore, we can't see anything, no instrumentation for these applications.

(09:53):

So what we've done in Obi is we sort of looked at for a while at the ability to instrument with Uprobes, and we gave up because of the complexities I explained. So we decided to build a tiny Java agent that can be dynamically injected. So it could be a running process that will accept this agent, and we will load it with a very small number of probes that would be able to just fill in the gaps we're not able to see at the kernel level. So what does this mean? This Java agent only instruments TLS and the thread pools in Java. It's bare bones, doesn't instrument any application libraries. We wanted to keep the same concept as the rest of OpenTelemetry instrumentation. Try to keep it as generic as possible so that we can instrument the world without having to instrument every individual library. So it doesn't matter if it's built on top of Spring Boot or it's using Netty Reactor or it's using Apache something.

(11:00):

It doesn't really matter. At the end of the day, it's still HTTPS. If we can tap into the protocol, add the TLS time, we could actually capture and get all of that data. And we've added one other piece of technology on top of this where we actually harvest the routes embedded into the symbols of the generated Java classes. So for example, you may have in Spring Boot a path that says /users/curlybracesid. When you see that route coming on the buffers from the kernel is users/123. So we do that matching. We understand that this 123 matches to the Spring Boot path just like Spring Boot would do. So how does this actually work in practice? And I know it's maybe a lot to kind of consume, but the two pieces of information to make OB work really well with any sort of programming language is we need two pieces of information.

(12:04):

We need connection information. So we need to know who's talking to who so we can find the IP addresses and ports and then from there try to kind of enrich this information from Kubernetes and we need the buffers so we can actually analyze the protocol, figure out its HTTP and Postgres on these sides. So let's take this one example we needed to do. So Java internally, once we actually added the probes, has two modes to handle TLS. One is synchronous where it's the secure socket. It's the class internally that everybody uses and there's plenty of Java libraries that use that approach. It's the same class that handles both the communication with the peer and it's the same class that does the SSL encryption. Great. We could extract both the connection information and the unencrypted buffers from the same instrumentation. However, other implementations of SSL encryption in Java use two separate kind of paths.

(13:10):

For example, Netty. They have their own high performance socket implementation but they use the SSL engine from Java to do the encryption. And so the actual work here is asynchronous. One part does the encryption, the other part does the actual communication to the other side. So how do we correlate these buffers? It was actually an innovation by my manager, Fabian Staberg, who actually brought us the idea. He said, "Well, you can track all the threads you want but you'll actually end up in a situation where you miss and then you cannot correlate. However, if encryption is done right, the actual encrypted buffer should be unique enough for you to create a key. If it's a buffer is the same all the time, then what you end up is actually encryption that can be broken into." So what we do is we take sufficient number of bytes of the encrypted text at the time of encryption and use that as a key in a map, in memory map to the actual buffer that actually we wanna read.

(14:13):

That happens in encryption time. When this actually encrypted text is sent over the wire to the other side, this is when we extract the key again, look up the map, and at that point we have both the connection information and the buffer that we wanna read. So we looked into how do we, now that we have this information, how do we pass this information to the eBPF side? It's all great, we found it in Java. So this agent now does a very fast system call, sysioctl, where it communicates a Java buffer that has been created for us to read. So this sysioctl call happens in the kernel and this is the chance for us to set a K-probe with OB that will intercept that call, read the information and pass it to the rest of the OB machinery to actually produce the telemetry that we need.

(15:04):

So we just filled in the missing pieces that we needed to get Java to work. So Endre here has been playing with this for the time that we've been developing it. So why don't you tell us a little bit about the things you've done with this?

Endre Sara (15:19):

Sure, so we are working with a number of customers. For example, there is this one customer that's running an edge security application and they've got various Java clients, Spring Boot clients that are talking to Keycloak. Now, Keycloak is natively instrumented, but obviously it's more important how the clients are interacting with Keycloak, which is over HTTPS. We have other customers in the digital marketing space that extensively uses Google Pub/Sub, also over SSL, and then a large betting company that has a number of Java applications. Most of our applications are in Java and they're using managed Postgres, actually other managed databases over TLS as well as Kafka. Now, I can't bring their applications here, so I thought I'll write something that represents their information and all of them are public or open source. Each of these can be deployed by a simple ham chart in your own communities cluster.

(16:14):

If you want to try any of this combination on how do they work with Bela, I actually have a branch that does native OTA instrumentation on each one of them, and then a branch that removes all the OTA instrumentation so that you could see how they work with Bela or without Bela with the native instrumentation. Have a lot of fun. Please tell me if there is anything that I could do differently about them, but try it if you want to.

Nikola Grcevski (16:40):

And we've had a number of bug reports from him coming. He always tries some new framework and some new protocol and whatever, and it breaks and says, "Hey, I tried this and it didn't work." And then we make it better. So a lot of trial and error, but we have a bunch of trade-offs still. Why don't you tell us what works, what doesn't?

Endre Sara (16:59):

So a number of things that works very well. I tried a number of scenarios along the examples that I mentioned for both HTTP and gRPC communication over TLS. A number of these applications are Spring Boot, the way that they communicate with each other. I tested both with a managed Postgres service in AWS, an MSK broker in AWS over TLS. And in general, the trace context propagation works across these components, except with TLS.

Nikola Grcevski (16:59):

Yeah.

Endre Sara (16:59):

Right? One of my examples is really in Quercus. So if you want to try it out, you could also compile it in different ways. So I think that the thing that we have to still work on, one of them is the TLS-based context propagation.

Nikola Grcevski (17:54):

Yeah, that doesn't work yet.

Endre Sara (17:55):

And then the other thing that Nikola was alluding to is that this is a bit detailed, but if you are running the latest, latest version of the Kafka library, and you're running the latest, latest version of the Kafka broker, they negotiate a V13 protocol between them and the current released Vela supports and V12. I think that your PR is out. I'm not sure if it has been merged for V13. It will be soon in Vela as well. So these things are pretty cool that they are coming up as we're checking basically the protocols. But I think that what's really important about all of these is that while they look like separate features, what Nikola mentioned about writing this once and then allowing all kinds of libraries, all kinds of frameworks, all kinds of communication to be all instrumented at the same time is pretty powerful. That you don't have to pick a special Vela version for each Confluent versus MSK versus Google Pub/Sub versus other messaging protocols.

Nikola Grcevski (19:00):

Yeah, and I was gonna say, like we did add thread support, but there's reactive libraries that we don't support. Like I said, we wanted to keep it JVM specific. So JDK alone, we haven't added any of the reactive. So if you use RxJava or any of that stuff, thread correlation may not work. So why don't we see where, should we choose this EBP versus Java agent? And sometimes they're not.

Endre Sara (19:25):

So first of all, I think while this looks like a comparison I think that really we are thinking about this as complimentary to us and not competitors. As we look at our existing customer base, we walk into customers where they have some instrumentation but not everything, have nothing, but they want to move somewhere. So really what I see is that the combination of these choices is what allows customers to really be successful in instrumenting their applications. Now, there are a couple of cases where if you want to fully instrument every single library and especially if you have custom business metric, then the Java agent is probably a better choice. But on the conversely, on the other side, if you cannot modify the code, if you're running third-party applications and probably even more interestingly, if you just want a zero-touch, day-zero instrumentation of all of your applications, then Obi and Bela is probably the best choice.

(20:23):

What practically actually happens in most of our customer cases is that we roll out Bela everywhere. And then they say like, well, this particular application, I think I can instrument this better either because it's either too chatty or there are these specific business metrics or specific communications that I want to instrument. And what's really cool about Obi and Bela is that it recognizes if an application has native instrumentation and it doesn't instrument it. So what we end up almost organically and naturally is the combination of all the things that you really care about instrumented as accurately as you wish with the Java agent and everything else just broadly and instantly instrumented using Obi and Bela. And of course, there are the cases where most of our customers don't only run in Java, don't only run in a single language. So like, okay, fine, I know how to instrument Java, but I have this Python code, now I have this Node.js code.

(21:23):

And with Bela, you basically get the full consistent, semantically consistent instrumentation of all of your applications.

Nikola Grcevski (21:31):

So key takeaways?

Endre Sara (21:35):

Yeah, so I think it's an amazing possibility of using Bela in your environment, especially where you cannot deploy your own Java agent. So try it, I think it's really cool that it's really focusing on application instrumentation. I mean, people care about network level statistics, but what I think is really important is to be able to understand the behavior of the system, the interaction between the services, even down to the API, Kafka topic, SQL database table query level. And you basically get this out of the box with Obi, including with TLS. And then make the right choices between where you want to use the Java SDK. And please join us.

(22:30):

Any feedback, contribution, issues?

Nikola Grcevski (22:34):

Yeah, issues, deal pull requests, very much welcome to anyone. Yeah, that's it. These are some links and resources we have if you wanted to get a link to the project and participate. But yeah, that's it. I think we're good. Thank you.

Speakers

Nikola Grcevski
Principal Software Engineer — Grafana Labs
Endre Sara
Co-Founder — Causely

eBPF application instrumentation for Java: challenges, design, and real-world examples

Speakers

Nikola Grcevski

Endre Sara

Still have questions?

Get every update