
Continuous profiling for native code: Understanding the what, why, and how
It’s hard to imagine deploying any application today without observability. Logs have been around since the early days of mainframes, metrics became standard with early Unix systems, and tracing gained traction in the mid-2010s as distributed architectures took off. Together, these three signals have become the foundation of how we understand and operate modern software.
Profiling, as a debugging practice, has also been around for a long time. Then, in the mid-2010s, a number of products emerged that gave rise to profiles as the fourth signal of observability. Initially, continuous profiling was exclusive to managed languages, but recent advances in eBPF have made it universally applicable.
In this post, we’ll explore the benefits of continuous profiling, and walk through an example of using it to gain visibility into the performance of a native-language codebase.
Benefits of continuous profiling
When developing an application, adding metrics and logs is a conscious effort that requires planning and forethought; the developer has to think ahead about measurable parameters and insert measurement points. While this is a good practice, there are cases where this may not result in full visibility. These include:
- Interaction of multiple system changes: Simultaneous changes in different parts of the system may not affect performance individually, but their combined effect can have an unexpected impact.
- Requirement change: A function suddenly or gradually becomes widely used.
- An issue with something widely used that would not make sense to instrument manually — for example, while a Hash DoS attack exists, it won’t make sense to add metrics to every HashMap use in code.
- Something a developer simply didn’t think about.
Unlike the manual process of adding log messages and application-level metrics to code, continuous profiling — being based on sampling — surfaces unknown unknowns in performance, without introducing high costs.
What tangible benefits can a team gain from using continuous profiling? Let’s go over a few possibilities:
Proactive issue detection
By continuously sampling performance in production, it’s possible to detect performance issues proactively, reducing performance-related SLI impact.
Differential performance analysis
By combining live performance sampling with techniques like A/B testing or blue-green deployment, it is possible to analyze the impact of code/configuration changes in near-real-time, and minimize impact and support overhead in case of degraded performance.
Cost savings in cloud deployments
Even with sustained use discounts in cloud environments, CPU usage directly translates to incurred cost. Having deep visibility into CPU spend — with specific targets highlighted — can help drive both short- and long-term cost savings.
SLI improvement in on-premises environments
It’s slightly harder to achieve short-term cost savings in on-premises environments with servers that have already been bought. However, continuous profiling can still reduce CPU usage to enable lower service latency and fewer possibilities for capacity incidents; decrease system latency; and, long term, reduce the need to procure new hardware.
Identification of “unknown unknowns”
Most importantly, profiling does not make assumptions about how the code is structured or executed. It provides an objective quantification of CPU resource spend, uncovering inefficiencies which were, for whatever reason, not predicted.
Continuous profiling for native code
While tools like Profiler and Perfmon have made runtime instrumentation relatively straightforward for managed languages, achieving similar visibility for native code has always been a challenge.
The higher effort required to develop and maintain a performance-critical code base can also make development teams understandably wary of running profiling tools in production environments. One common concern is how the addition of profiling will impact performance and, more importantly, the correctness of the observed program.
Thankfully, recent adoption of eBPF as the tool of the trade for profiling helps mitigate these risks. As eBPF routines are executed in the kernel space, the performance impact on the profiled application is minimal; at the same time, JIT compilation with builtin correctness checks ensures that profiling hooks do not have unintended effects on the program under scope. As a result, it is now possible to profile native applications with performance impacts that are comparable to running in a container.
As of v0.136, the OpenTelemetry Collector supports continuous profiling, using the experimental profiles signal. This implementation is available in the ebpf-profiler repository. At the moment, only amd64/arm64 Linux targets are supported; however, this covers the majority of current use cases in on-premise/cloud deployments.
Grafana Alloy — Grafana Labs’ OpenTelemetry Collector distribution with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles — builds on this by additionally offering native stack unwinding and C++/Rust function name demangling.
We’ll make use of both these features next.
A step-by-step example: profiling CPU usage of a native, off-the-shelf application
One of the more demanding applications of low-level languages is database development, which requires both consistent commitment to coding discipline and strong performance culture.
In this example, we’ll apply off-the-shelf tools to gain insights into the performance of an open source C++ application, without introducing dependencies or code changes.
Setup
To perform the experiment, we’ll use the following:
- 2 c7a.2xlarge AWS instances
- Grafana Alloy installed on one server, running as root
- Grafana Cloud Profiles, a hosted continuous profiling tool powered by Grafana Pyroscope
- A database executable built in C++. I’ve chosen DuckDB, as it’s easy to build from source and benchmark.
After generating the dataset, we can install DuckDB on both servers, generate a TPC-H dataset, and configure Alloy on one machine, using a fairly simple file:
discovery.process "all" {}
discovery.relabel "service" {
targets = discovery.process.all.targets
rule {
action = "replace"
source_labels = ["__meta_process_exe"]
regex = ".*/([^/]+)$"
target_label = "service_name"
replacement = "$1"
}
}
pyroscope.ebpf "process" {
demangle = "templates"
sample_rate = 20
forward_to = [pyroscope.write.remote.receiver]
targets = discovery.relabel.service.output
}
pyroscope.write "remote" {
endpoint {
url = "....."
basic_auth {
username = "...."
password = env(“GRAFANA_TOKEN”)
}
}
external_labels = {
"instance" = env("HOSTNAME")
}
}The relabeling process is required so that the process name can be picked up as “service,” which is one way to set up profiles.
A quick look into Profiles Drilldown view shows us the data is flowing:

Note the demangled C++ names, thanks to built-in demangling support in Alloy on Linux.
Benchmark with DuckDB
To validate the performance impact, we can use a simple Python script that populates a TPC-H database and then benchmarks queries against it using DuckDB’s own TPC-H extension and internal timer (“.timer on”).
Here are the results of the benchmark with 200 runs per query. We are not including warmups to ensure we can see the effects of profiling on cold runs. Run time is in milliseconds, with quantiles calculated using NumPy:
Results with profiling on:
Query
p50
p75
p90
p95
p99
p100
Q1
295
296
296
297
304.07
433
Q2
177
188
196
198.1
208.02
228
Q3
264.5
270
279
282.05
287
413
Q4
224
226
228
229
234.03
336
Q5
710
714
720.2
729
737.08
867
Q6
64
65
65
65
65.01
172
Q7
672
675
678
680
690.05
861
Q8
346
348
350
351
354
529
Q9
3202.5
3237.25
3303.4
3336.1
3398.28
3484
Q10
831
835
840.1
845
849
1052
Q11
122
123
123
124
126
163
Q12
163
164
165
165
165.01
279
Q13
877
884.25
896.1
903
912.03
2172
Q14
163
166
168
170
171.28
307
Q15
104
105
106
107
108.01
244
Q16
163.5
165
166
167
168
194
Q17
229
229
230
230
231
345
Q18
1945
1956
1967
1977.25
1992.04
2074
Q19
298
300
303
304
323.01
479
Q20
148
150
151
152
153.02
305
Q21
2027
2054
2072.5
2094.5
2151.19
2195
Q22
145
146
147
147
155.13
170
Results with profiling off:
Query
p50
p75
p90
p95
p99
p100
Q1
294
294
295
295
296
453
Q2
174.5
183
189
193
201.09
213
Q3
257.5
263
272.1
274
284.04
398
Q4
222
224.25
226
227
229.01
315
Q5
703
706
709
712.05
732.04
862
Q6
65
65
65
65
65.01
169
Q7
665
669
672
673
678.06
849
Q8
344
346
349
350.05
352.02
520
Q9
3208
3251.25
3287
3316.35
3391.05
3425
Q10
837
843
848
852.05
859.03
1560
Q11
122
123
123
124
124.01
141
Q12
165
166
167
167
168
299
Q13
893
899
904.1
911.05
922.01
2243
Q14
159
161
163.1
167
169.06
304
Q15
104
104
104
105
105.01
241
Q16
164
165
166
166
168.02
193
Q17
229
230
230
231
231
346
Q18
1956
1966
1975.1
1988
2024.73
2101
Q19
300
302
304
305
308.02
479
Q20
148
149
150
151
152.02
323
Q21
2045
2073
2113
2129.25
2161.01
2166
Q22
147
148
149
150
150.01
169
As expected, there are some outliers at p100 where the non-profiled run is slower because of statistical noise, but the general performance impact is consistently within a few percentage points.
Analysis with Drilldown
Let’s see if making that performance tradeoff gave us any useful insights.
Using the Drilldown → Profiles screen again, we can filter for duckdb executable by labels, and see the distribution of own/nested CPU time for functions:

Further context is provided by the flame graph view. We can limit it to ExecutorTask::Execute function here for better visibility:

Thanks to stack walking and demangling, we see a clear list of functions that use the most CPU in our scenario. If we were to attempt to optimize the runtime of the TPC-H benchmark in DuckDB, our first candidates for optimization would be:
duckdb::JoinHashTable::GetRowPointersduckdb::JoinHashTable::InsertHashesduckdb::JoinHashTable::ScanStructure::AdvancePointersduckdb::GroupedAggregateHashTable::FindOrCreateGroupsInternalduckdb::(anonymous namespace)::StringCompress<duckdb::uhugeint_t>
Benchmark with PostgreSQL
DuckDB is a relatively niche product, so let’s compare the impact on PostgreSQL. After installing it on the same machines and running pgbench, we can see a similar picture.
Results with profiling on:
pgbench (16.10 (Ubuntu 16.10-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type:
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 1.292 ms
initial connection time = 1.764 ms
tps = 774.093266 (without initial connection time)Results with profiling off:
pgbench (16.10 (Ubuntu 16.10-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type:
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 1.252 ms
initial connection time = 1.757 ms
tps = 798.648049 (without initial connection time)For Postgres, the performance impact of profiling is within 4%. A little caveat is that a custom, non-stripped build of Postgres is required to enjoy full stack walking capabilities. However, immediate access to detailed performance data may be worth the tradeoff.
Outcome
With off-the-shelf software components, we were able to get a detailed look into the performance of a highly optimized codebase, and get clear targets for optimizing the CPU usage. The observed performance impact was minimal — within a few percentage points.
What we didn’t explore
There are a number of capabilities that we didn’t explore here, but are worth pointing out:
- Off-CPU profiling: Both OpenTelemetry and Alloy profilers can also monitor off-CPU events, meaning periods when a thread is not actively executing on the CPU.
- Performance comparison: Grafana Cloud Profiles supports the
diffview to analyze changes in application performance over time.
These capabilities can further enhance the value teams get out of eBPF profiling.
Wrapping up
The state of observability is constantly evolving. Just a few years ago, getting real-time visibility into the performance of a native codebase in production required a continuous investment in custom-built tools — and specialized engineering skills.
Now, thanks to engineering efforts across the community, teams can gain detailed and comprehensive insights into native codebases with off-the-shelf components and continuous profiling.
I invite the community to try the tools and services described in this blog and to provide feedback — we are always listening on our community Slack.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!



