Continuous profiling for native code: Understanding the what, why, and how

• 2025-11-14 • 10 min

It’s hard to imagine deploying any application today without observability. Logs have been around since the early days of mainframes, metrics became standard with early Unix systems, and tracing gained traction in the mid-2010s as distributed architectures took off. Together, these three signals have become the foundation of how we understand and operate modern software.

Profiling, as a debugging practice, has also been around for a long time. Then, in the mid-2010s, a number of products emerged that gave rise to profiles as the fourth signal of observability. Initially, continuous profiling was exclusive to managed languages, but recent advances in eBPF have made it universally applicable.

In this post, we’ll explore the benefits of continuous profiling, and walk through an example of using it to gain visibility into the performance of a native-language codebase.

Benefits of continuous profiling

When developing an application, adding metrics and logs is a conscious effort that requires planning and forethought; the developer has to think ahead about measurable parameters and insert measurement points. While this is a good practice, there are cases where this may not result in full visibility. These include:

Interaction of multiple system changes: Simultaneous changes in different parts of the system may not affect performance individually, but their combined effect can have an unexpected impact.
Requirement change: A function suddenly or gradually becomes widely used.
An issue with something widely used that would not make sense to instrument manually — for example, while a Hash DoS attack exists, it won’t make sense to add metrics to every HashMap use in code.
Something a developer simply didn’t think about.

Unlike the manual process of adding log messages and application-level metrics to code, continuous profiling — being based on sampling — surfaces unknown unknowns in performance, without introducing high costs.

What tangible benefits can a team gain from using continuous profiling? Let’s go over a few possibilities:

Proactive issue detection

By continuously sampling performance in production, it’s possible to detect performance issues proactively, reducing performance-related SLI impact.

Differential performance analysis

By combining live performance sampling with techniques like A/B testing or blue-green deployment, it is possible to analyze the impact of code/configuration changes in near-real-time, and minimize impact and support overhead in case of degraded performance.

Cost savings in cloud deployments

Even with sustained use discounts in cloud environments, CPU usage directly translates to incurred cost. Having deep visibility into CPU spend — with specific targets highlighted — can help drive both short- and long-term cost savings.

SLI improvement in on-premises environments

It’s slightly harder to achieve short-term cost savings in on-premises environments with servers that have already been bought. However, continuous profiling can still reduce CPU usage to enable lower service latency and fewer possibilities for capacity incidents; decrease system latency; and, long term, reduce the need to procure new hardware.

Identification of “unknown unknowns”

Most importantly, profiling does not make assumptions about how the code is structured or executed. It provides an objective quantification of CPU resource spend, uncovering inefficiencies which were, for whatever reason, not predicted.

Continuous profiling for native code

While tools like Profiler and Perfmon have made runtime instrumentation relatively straightforward for managed languages, achieving similar visibility for native code has always been a challenge.

The higher effort required to develop and maintain a performance-critical code base can also make development teams understandably wary of running profiling tools in production environments. One common concern is how the addition of profiling will impact performance and, more importantly, the correctness of the observed program.

Thankfully, recent adoption of eBPF as the tool of the trade for profiling helps mitigate these risks. As eBPF routines are executed in the kernel space, the performance impact on the profiled application is minimal; at the same time, JIT compilation with builtin correctness checks ensures that profiling hooks do not have unintended effects on the program under scope. As a result, it is now possible to profile native applications with performance impacts that are comparable to running in a container.

As of v0.136, the OpenTelemetry Collector supports continuous profiling, using the experimental profiles signal. This implementation is available in the ebpf-profiler repository. At the moment, only amd64/arm64 Linux targets are supported; however, this covers the majority of current use cases in on-premise/cloud deployments.

Grafana Alloy — Grafana Labs’ OpenTelemetry Collector distribution with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles — builds on this by additionally offering native stack unwinding and C++/Rust function name demangling.

We’ll make use of both these features next.

A step-by-step example: profiling CPU usage of a native, off-the-shelf application

One of the more demanding applications of low-level languages is database development, which requires both consistent commitment to coding discipline and strong performance culture.

In this example, we’ll apply off-the-shelf tools to gain insights into the performance of an open source C++ application, without introducing dependencies or code changes.

Setup

To perform the experiment, we’ll use the following:

2 c7a.2xlarge AWS instances
Grafana Alloy installed on one server, running as root
Grafana Cloud Profiles, a hosted continuous profiling tool powered by Grafana Pyroscope
A database executable built in C++. I’ve chosen DuckDB, as it’s easy to build from source and benchmark.

After generating the dataset, we can install DuckDB on both servers, generate a TPC-H dataset, and configure Alloy on one machine, using a fairly simple file:

discovery.process "all" {}

discovery.relabel "service" {
  targets = discovery.process.all.targets

  rule {
    action        = "replace"
    source_labels = ["__meta_process_exe"]
    regex         = ".*/([^/]+)$"
    target_label  = "service_name"
    replacement   = "$1"
  }
}

pyroscope.ebpf "process" {
  demangle = "templates"
  sample_rate = 20
  forward_to = [pyroscope.write.remote.receiver]
  targets = discovery.relabel.service.output
}

pyroscope.write "remote" {
  endpoint {
    url = "....."

    basic_auth {
      username = "...."
      password = env(“GRAFANA_TOKEN”)
    }
  }
  external_labels = {
    "instance" = env("HOSTNAME")
  }
}

The relabeling process is required so that the process name can be picked up as “service,” which is one way to set up profiles.

A quick look into Profiles Drilldown view shows us the data is flowing:

A colorful flame graph displaying performance data, with various labeled blocks in different colors and a tooltip showing detailed metrics.

Note the demangled C++ names, thanks to built-in demangling support in Alloy on Linux.

Benchmark with DuckDB

To validate the performance impact, we can use a simple Python script that populates a TPC-H database and then benchmarks queries against it using DuckDB’s own TPC-H extension and internal timer (“.timer on”).

Here are the results of the benchmark with 200 runs per query. We are not including warmups to ensure we can see the effects of profiling on cold runs. Run time is in milliseconds, with quantiles calculated using NumPy:

Results with profiling on:

Query	p50	p75	p90	p95	p99	p100
Q1	295	296	296	297	304.07	433
Q2	177	188	196	198.1	208.02	228
Q3	264.5	270	279	282.05	287	413
Q4	224	226	228	229	234.03	336
Q5	710	714	720.2	729	737.08	867
Q6	64	65	65	65	65.01	172
Q7	672	675	678	680	690.05	861
Q8	346	348	350	351	354	529
Q9	3202.5	3237.25	3303.4	3336.1	3398.28	3484
Q10	831	835	840.1	845	849	1052
Q11	122	123	123	124	126	163
Q12	163	164	165	165	165.01	279
Q13	877	884.25	896.1	903	912.03	2172
Q14	163	166	168	170	171.28	307
Q15	104	105	106	107	108.01	244
Q16	163.5	165	166	167	168	194
Q17	229	229	230	230	231	345
Q18	1945	1956	1967	1977.25	1992.04	2074
Q19	298	300	303	304	323.01	479
Q20	148	150	151	152	153.02	305
Q21	2027	2054	2072.5	2094.5	2151.19	2195
Q22	145	146	147	147	155.13	170

Results with profiling off:

Query	p50	p75	p90	p95	p99	p100
Q1	294	294	295	295	296	453
Q2	174.5	183	189	193	201.09	213
Q3	257.5	263	272.1	274	284.04	398
Q4	222	224.25	226	227	229.01	315
Q5	703	706	709	712.05	732.04	862
Q6	65	65	65	65	65.01	169
Q7	665	669	672	673	678.06	849
Q8	344	346	349	350.05	352.02	520
Q9	3208	3251.25	3287	3316.35	3391.05	3425
Q10	837	843	848	852.05	859.03	1560
Q11	122	123	123	124	124.01	141
Q12	165	166	167	167	168	299
Q13	893	899	904.1	911.05	922.01	2243
Q14	159	161	163.1	167	169.06	304
Q15	104	104	104	105	105.01	241
Q16	164	165	166	166	168.02	193
Q17	229	230	230	231	231	346
Q18	1956	1966	1975.1	1988	2024.73	2101
Q19	300	302	304	305	308.02	479
Q20	148	149	150	151	152.02	323
Q21	2045	2073	2113	2129.25	2161.01	2166
Q22	147	148	149	150	150.01	169

As expected, there are some outliers at p100 where the non-profiled run is slower because of statistical noise, but the general performance impact is consistently within a few percentage points.

Analysis with Drilldown

Let’s see if making that performance tradeoff gave us any useful insights.

Using the Drilldown → Profiles screen again, we can filter for duckdb executable by labels, and see the distribution of own/nested CPU time for functions:

A list of executables with associated time durations and percentages, displayed in a dark-themed interface.

Further context is provided by the flame graph view. We can limit it to ExecutorTask::Execute function here for better visibility:

A flame graph visualizing time distribution of various tasks and processes over a total of 3.97 hours.

Thanks to stack walking and demangling, we see a clear list of functions that use the most CPU in our scenario. If we were to attempt to optimize the runtime of the TPC-H benchmark in DuckDB, our first candidates for optimization would be:

duckdb::JoinHashTable::GetRowPointers
duckdb::JoinHashTable::InsertHashes
duckdb::JoinHashTable::ScanStructure::AdvancePointers
duckdb::GroupedAggregateHashTable::FindOrCreateGroupsInternal
duckdb::(anonymous namespace)::StringCompress<duckdb::uhugeint_t>

Benchmark with PostgreSQL

DuckDB is a relatively niche product, so let’s compare the impact on PostgreSQL. After installing it on the same machines and running pgbench, we can see a similar picture.

Results with profiling on:

pgbench (16.10 (Ubuntu 16.10-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type:
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 1.292 ms
initial connection time = 1.764 ms
tps = 774.093266 (without initial connection time)

Results with profiling off:

pgbench (16.10 (Ubuntu 16.10-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type:
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 1.252 ms
initial connection time = 1.757 ms
tps = 798.648049 (without initial connection time)

For Postgres, the performance impact of profiling is within 4%. A little caveat is that a custom, non-stripped build of Postgres is required to enjoy full stack walking capabilities. However, immediate access to detailed performance data may be worth the tradeoff.

Outcome

With off-the-shelf software components, we were able to get a detailed look into the performance of a highly optimized codebase, and get clear targets for optimizing the CPU usage. The observed performance impact was minimal — within a few percentage points.

What we didn’t explore

There are a number of capabilities that we didn’t explore here, but are worth pointing out:

Off-CPU profiling: Both OpenTelemetry and Alloy profilers can also monitor off-CPU events, meaning periods when a thread is not actively executing on the CPU.
Performance comparison: Grafana Cloud Profiles supports the diff view to analyze changes in application performance over time.

These capabilities can further enhance the value teams get out of eBPF profiling.

Wrapping up

The state of observability is constantly evolving. Just a few years ago, getting real-time visibility into the performance of a native codebase in production required a continuous investment in custom-built tools — and specialized engineering skills.

Now, thanks to engineering efforts across the community, teams can gain detailed and comprehensive insights into native codebases with off-the-shelf components and continuous profiling.

I invite the community to try the tools and services described in this blog and to provide feedback — we are always listening on our community Slack.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!