Why OpenTelemetry instrumentation needs both eBPF and SDKs

Fabian Stäber

Edwin Onattu

•

2025-12-18•13 min

As a vendor-neutral open standard, OpenTelemetry has become the default choice for application instrumentation. However, it’s important to remember that OpenTelemetry isn’t a single technology — it’s an ecosystem. Under the hood, it provides multiple options for instrumenting your applications.

In this blog post, we explore two instrumentation approaches: OpenTelemetry eBPF Instrumentation and runtime-specific OpenTelemetry SDKs, like the OpenTelemetry Java agent.

We’ll explore the benefits of each option, how eBPF instrumentation and SDK instrumentation complement each other, and why we recommend a hybrid model, combining eBPF for comprehensive baseline metrics and SDKs for deeper insights.

SDKs and eBPF: A quick overview

Before we get into the differences between SDKs and eBPF for instrumentation, and why it’s beneficial to use both, here’s a brief overview of each technology.

SDKs

Software Development Kits (SDKs) are runtime-specific instrumentation tools provided by the OpenTelemetry community. SDKs provide auto-instrumentation features, like the OpenTelemetry Java agent, as well as APIs for manual instrumentation. The details of an SDK vary by programming language.

eBPF

Extended Berkeley Packet Filter (eBPF) is a feature of the Linux kernel for auto-instrumenting applications. It’s used by OpenTelemetry eBPF Instrumentation (OBI) and by Grafana Beyla, the open source eBPF-based, zero-code instrumentation tool.

Earlier this year, Grafana Labs donated Beyla to OpenTelemetry, under the OBI project name. As part of the donation, we also announced that Beyla will continue to exist as Grafana Labs’ distribution of the upstream OBI project.

The benefits of using eBPF and SDKs in parallel

Modern observability demands both breadth and depth, and no single instrumentation technique delivers it all. This is why our recommendation is a hybrid approach where you instrument backend services with eBPF and SDKs at the same time. This might sound counterintuitive at first, because the initial impulse is to avoid duplicate instrumentation. However, eBPF and SDKs are not alternatives to each other; they complement each other.

In their ObservabiltyCON 2025 presentation "Deploying OpenTelemetry with Grafana Cloud," Ted Young, Developer Programs Director at Grafana Labs and a co-founder of the OpenTelemetry project, and Edwin Onattu, Staff Product Manager at Grafana Labs, introduced the "Hierarchy of Observability Needs.” This hierarchy starts with infrastructure visibility at the bottom and leads up to custom metrics at the top.

Orange four-layer pyramid: bottom Infrastructure Visibility, Baseline Service Visibility, Deep Transaction Insights, top Custom Logic.

The instrumentation strategy presented in this post covers the top three layers: baseline service visibility, deep transaction insights, and custom logic. In the next sections, we’ll explain why we recommend eBPF for baseline service visibility, but recommend SDKs for deep transaction insights and custom logic.

Four reasons for choosing eBPF for comprehensive baseline metrics

At the foundational layer of an observability strategy, every system needs a dependable set of comprehensive baseline metrics. These are the universal signals that provide unified insights into service health and behavior at a fundamental level — and this is where eBPF-based instrumentation shines. Even when runtime-specific SDKs are available, relying on eBPF for this baseline offers clear advantages.

Here are four reasons why.

1. eBPF instrumentation is a platform feature

While SDK instrumentation is an application feature, eBPF instrumentation is a platform feature designed to be rolled out to all cluster nodes, such as a DaemonSet on Kubernetes. That way, it provides universal comprehensive baseline metrics for all applications installed on the cluster.

As eBPF operates on a network layer and not on the application layer, request rates, error rates, latencies, and service graphs become universally available independent of the programming language or runtime, and independent of how applications are deployed. This includes applications where no SDK is available, like legacy Java applications older than Java 8, or legacy Python applications older than Python 3.9.

2. eBPF metrics represent what clients see

eBPF instrumentation looks at applications from the outside, on the network level, while SDK instrumentation looks at internal behavior. This may result in significantly different data.

To illustrate this, we ran a simple demo with a modern Spring Boot REST service. As shown in the screenshot, at 10:16 we increased the number of parallel requests to overload the REST service. Tomcat’s thread pool got exhausted and requests were queued.

eBPF metrics correctly showed how request times are going up. This accurately represents what clients see.

eBPF chart: 95% latency steady ~0.9s, sharp jump around 10:16 to ~3.3s, then remains elevated around 3.2-3.3s.

SDK metrics provided by the OpenTelemetry Java agent don’t show an increase in latency.

Dark dashboard graph titled "SDK," showing a flat green 95% latency line at about 800 ms between 10:10 and 10:20.

The reason is that SDKs observe the times when a request is actively being serviced, which does not include queue times happening before a request gets handled. Conversely, eBPF captures the full request times, including queue times.

3. eBPF tools provide service graph metrics out of the box

Service graph metrics show the relationship between services, providing information on which service calls which other service. This is the basis for features like Grafana Tempo service graphs, Grafana Cloud Knowledge Graph, and inbound and outbound calls in Grafana Cloud Application Observability.

A diagram of a service graph, showing green circular service nodes labeled checkout, currency, shipping, quote, email, payment, kafka, accounting, linked by green arrows.

SDKs don’t provide service graph metrics out of the box. The workaround is to generate service graph metrics from spans provided by distributed tracing. Service graph metrics can be generated either by the OpenTelemetry Collector using the service graph connector or by the Tempo database using Tempo’s built-in service graph metrics generator.

That said, generating service graph metrics from spans can cause overhead, both for the backend services being monitored and for the metrics generation pipeline. Distributed tracing needs to be configured with a 100% span sample rate, providing a span for every network call. The metrics generation pipeline needs to bring together client and server spans for each call, which requires caching and may cause significant memory usage.

With OBI and Beyla, generating service graph metrics from spans is not necessary because these tools can provide service graph metrics out of the box. This significantly reduces overhead both for the services being monitored and for the processing pipelines.

4. eBPF tools are a central place for making new features universally available

OBI and Beyla have an active community of contributors who continuously add new features. One example is network metrics for observing communication between hosts. Network metrics often provide valuable insights, as they surface dependencies that are not captured by just looking at applications. Other examples are inter-zone traffic metrics for observing communications between availability zones, or process metrics for observing CPU usage, memory usage, disk I/O, and network I/O.

As eBPF metrics work on a network and kernel level, new features become immediately available across technologies. For example, when eBPF tools add support for the message queue protocol AMQP, that protocol will be supported for all programming languages and runtimes from day one. With SDKs, each SDK for each programming language needs to implement this independently.

Three reasons to add SDK auto-instrumentation for deeper insights

eBPF excels at establishing a comprehensive service health baseline with minimal overhead, but understanding why a request failed or slowed down requires application-level instrumentation. That’s where SDK instrumentation becomes essential.

1. Distributed tracing

While eBPF supports distributed tracing to some extent, this support is limited. OBI offers production-ready tracing for Go, NodeJS, Python, Ruby, and for frameworks that don't switch threads while handling a request. However, tracing is not supported for frameworks like reactive Java applications. For these frameworks, eBPF instrumentation will provide spans representing single network calls, but these spans may be stand-alone and not correlated with the right trace.

Moreover, traces provided by SDKs offer deeper insights than traces provided by eBPF. For example, the OpenTelemetry Java agent instruments internal layers of a Java application, and provides internal spans representing calls to Spring Boot handlers, Hibernate transaction boundaries, and more. Furthermore, the Java agent will attach events like Exceptions being thrown, including the stack trace. This level of detail is beyond the scope of eBPF instrumentation.

A tracing timeline showing demo POST /demo/add (47.05ms) with nested UserRepository.save, Session.persist, SELECT/UPDATE queries, Transaction.commit and INSERT bars.

So, while eBPF tracing is great for some languages like Go, the general best practice is to use SDKs for tracing. However, when you use a hybrid approach with eBPF for baseline metrics and SDK for tracing, you won't need to generate metrics from spans. Therefore, you can configure the SDK with a reasonable head sampling rate for traces to mitigate the performance impact on applications. (See reason 3 above).

2. Trace contexts for application logs

Some SDKs provide a way to automatically enrich log lines with the trace ID if the log line has been logged while the server was processing a network request. This is very useful for debugging. For example, if you see a distributed trace with an HTTP status 500, you can filter logs by trace ID and view all errors that have been logged in the context of that trace.

This feature is currently only provided by SDKs, not by eBPF instrumentation.

3. Runtime metrics

SDKs provide runtime-specific metrics like metrics on heap usage and garbage collection activity for the JDK. These metrics complement the process metrics provided by eBPF instrumentation.

How combining eBPF and SDKs works in practice

Combining eBPF and SDKs is relatively simple. OBI and Beyla automatically detect SDKs and avoid duplication of signals. If an SDK pushes spans to an OTLP endpoint, the eBPF tools will turn off their tracing feature. If an SDK pushes metrics to an OTLP endpoint, eBPF tools will turn off all metrics defined in OpenTelemetry’s semantic conventions, because these are provided by the SDK. eBPF instrumentation will still provide span metrics, service graph metrics, network metrics, and process metrics.

There is one caveat, though: if Grafana Tempo or the OpenTelemetry Collector generates metrics from spans, there will be conflicts if, simultaneously, eBPF instrumentation provides these metrics directly. To avoid this, you have to set the resource attribute span.metrics.skip=true on the spans to skip metrics generation.

So, when you instrument a service with an SDK for tracing, set the following environment variable:

OTEL_RESOURCE_ATTRIBUTES="span.metrics.skip=true"

This is a standard OpenTelemetry environment variable supported by all SDKs. SDKs will add the resource attribute to trace data, causing metrics generators to skip metrics generation for these spans.

Why you can skip span metrics generation from SDK traces

The above configuration step often raises an important question: if we disable span-derived metrics from SDK traces, do we lose visibility? The short answer is no. In a hybrid model, disabling span metrics generation from SDK traces does not reduce observability coverage.

eBPF instrumentation provides span metrics and service graph metrics directly, without relying on trace sampling. Because it operates at the network and kernel level, it observes all requests, producing unbiased baseline metrics that reflect what actually happened in the system.

SDKs are best used for deep transactional insights. They provide high-fidelity traces with rich semantic and library-level context for debugging and root cause analysis, rather than for generating baseline metrics.

If application-level metrics are required, many SDKs can emit semantic convention metrics directly. These metrics are explicit and sampling-independent, but they are not consistently available across all runtimes and cannot provide service graph metrics — coverage that eBPF provides consistently.

Other key considerations

Use OpenTelemetry APIs to instrument your business logic

With eBPF instrumentation for comprehensive baseline metrics and SDKs for deep transactional insights, your instrumentation strategy is already in a good place.

However, explicitly calling the OpenTelemetry API in your code to expose insights into the business logic will take your instrumentation experience one step further, and can significantly streamline root cause analysis.

Auto-instrumentation can only capture technical insights, like status codes or durations; it cannot determine the intent of the instrumented services. Explicitly instrumenting your business logic can provide this context. For example, if an online shop provides different payment methods, you can add a counter metric to track the total number of payments by payment method. Moreover, you can add a span attribute to mark the distributed span with the payment method selected.

So, once you have the lower layers in the instrumentation hierarchy covered, we recommend adding custom metrics to make your business logic observable.

Integration with Grafana Cloud

We recently launched Instrumentation Hub, a control plane for remote discovery and selective auto-instrumentation, now available in public preview for all Grafana Cloud users.

As the first step, we automated eBPF-based instrumentation for comprehensive baseline service metrics and infrastructure monitoring. The next step is to expand this and include library-level SDK auto-instrumentation.

Security

While OBI is a relatively new project, it grew out of Grafana Beyla, which is a mature, production-ready, enterprise-grade product.

eBPF instrumentation is generally low-risk, because it doesn’t modify the applications being monitored. If anything goes wrong, you can terminate eBPF instrumentation. The Linux kernel will automatically remove all eBPF probes and applications will run as before.

eBPF instrumentation does not require root access or privileged containers. It is sufficient to provide the instrumentation tool with a bespoke set of Linux kernel capabilities.

eBPF programs run in a sandboxed environment and must pass a strict verification process before execution, which minimizes the risk of security or stability issues. Before an eBPF program is loaded into the kernel, it is analyzed by a verifier to ensure it is safe. The verifier checks that the program cannot enter infinite loops, does not perform unsafe memory operations, and stays within defined limits on complexity and code size.

Learn more about Beyla security, permissions, and capabilities.

Limitations and next steps

There are currently some limitations with eBPF instrumentation that apply to both Grafana Beyla and the OBI project:

It’s Linux-only for the moment. eBPF instrumentation isn’t supported yet on Windows, but the eBPF for Windows project is gaining steam and will pave the way for future coverage.
Service graph metrics currently only work on Kubernetes. On other platforms you might see IP addresses rather than service names, which makes these metrics less meaningful.

We are actively working on closing these gaps. Our goal is to make eBPF instrumentation universally available for all hosts and platforms.

Apart from that, we are working on tighter integration between eBPF and SDKs. For example, we want eBPF instrumentation to support Exemplars for navigating from eBPF metrics to SDK traces.

Moreover, we want to provide a unified way to roll out eBPF instrumentation and SDK instrumentation. Where exactly you draw the line between signals that should be provided by eBPF and signals that should be provided by SDKs depends on various factors like the technology, operating system, and runtime versions. Ideally, a unified instrumentation tool will apply the right technologies so that users always get the best of both worlds.

Getting started

If you are a Grafana Cloud user, the easiest way to get started is to follow the onboarding instructions in the Grafana Cloud Instrumentation Hub. Follow these instructions and navigate to Connections → Collector in Grafana Cloud, install Alloy, and manage your instrumentation remotely.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!