Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

The actually useful free plan

Grafana Cloud Free Tier
check

10k series Prometheus metrics

check

50GB logs, 50GB traces, 50GB profiles

check

500VUk k6 testing

check

20+ Enterprise data source plugins

check

100+ pre-built solutions

Featured webinar

Getting started with grafana LGTM stack

Getting started with managing your metrics, logs, and traces using Grafana

Learn how to unify, correlate, and visualize data with dashboards using Grafana.

Continuous profiling for native code: Understanding the what, why, and how

Continuous profiling for native code: Understanding the what, why, and how

2025-11-14 10 min

It’s hard to imagine deploying any application today without observability. Logs have been around since the early days of mainframes, metrics became standard with early Unix systems, and tracing gained traction in the mid-2010s as distributed architectures took off. Together, these three signals have become the foundation of how we understand and operate modern software.

Profiling, as a debugging practice, has also been around for a long time. Then, in the mid-2010s, a number of products emerged that gave rise to profiles as the fourth signal of observability. Initially, continuous profiling was exclusive to managed languages, but recent advances in eBPF have made it universally applicable.

In this post, we’ll explore the benefits of continuous profiling, and walk through an example of using it to gain visibility into the performance of a native-language codebase.

Benefits of continuous profiling

When developing an application, adding metrics and logs is a conscious effort that requires planning and forethought; the developer has to think ahead about measurable parameters and insert measurement points. While this is a good practice, there are cases where this may not result in full visibility. These include:

  • Interaction of multiple system changes: Simultaneous changes in different parts of the system may not affect performance individually, but their combined effect can have an unexpected impact.
  • Requirement change: A function suddenly or gradually becomes widely used.
  • An issue with something widely used that would not make sense to instrument manually — for example, while a Hash DoS attack exists, it won’t make sense to add metrics to every HashMap use in code.
  • Something a developer simply didn’t think about.

Unlike the manual process of adding log messages and application-level metrics to code, continuous profiling — being based on sampling — surfaces unknown unknowns in performance, without introducing high costs. 

What tangible benefits can a team gain from using continuous profiling? Let’s go over a few possibilities:

Proactive issue detection

By continuously sampling performance in production, it’s possible to detect performance issues proactively, reducing performance-related SLI impact.

Differential performance analysis

By combining live performance sampling with techniques like A/B testing or blue-green deployment, it is possible to analyze the impact of code/configuration changes in near-real-time, and minimize impact and support overhead in case of degraded performance.

Cost savings in cloud deployments

Even with sustained use discounts in cloud environments, CPU usage directly translates to incurred cost. Having deep visibility into CPU spend — with specific targets highlighted — can help drive both short- and long-term cost savings.

SLI improvement in on-premises environments

It’s slightly harder to achieve short-term cost savings in on-premises environments with servers that have already been bought. However, continuous profiling can still reduce CPU usage to enable lower service latency and fewer possibilities for capacity incidents; decrease system latency; and, long term, reduce the need to procure new hardware. 

Identification of “unknown unknowns”

Most importantly, profiling does not make assumptions about how the code is structured or executed. It provides an objective quantification of CPU resource spend, uncovering inefficiencies which were, for whatever reason, not predicted.

Continuous profiling for native code

While tools like Profiler and Perfmon have made runtime instrumentation relatively straightforward for managed languages, achieving similar visibility for native code has always been a challenge.

The higher effort required to develop and maintain a performance-critical code base can also make development teams understandably wary of running profiling tools in production environments. One common concern is how the addition of profiling will impact performance and, more importantly, the correctness of the observed program.

Thankfully, recent adoption of eBPF as the tool of the trade for profiling helps mitigate these risks. As eBPF routines are executed in the kernel space, the performance impact on the profiled application is minimal; at the same time, JIT compilation with builtin correctness checks ensures that profiling hooks do not have unintended effects on the program under scope. As a result, it is now possible to profile native applications with performance impacts that are comparable to running in a container.

As of v0.136, the OpenTelemetry Collector supports continuous profiling, using the experimental profiles signal. This implementation is available in the ebpf-profiler repository. At the moment, only amd64/arm64 Linux targets are supported; however, this covers the majority of current use cases in on-premise/cloud deployments.

Grafana Alloy — Grafana Labs’ OpenTelemetry Collector distribution with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles — builds on this by additionally offering native stack unwinding and C++/Rust function name demangling.

We’ll make use of both these features next.

A step-by-step example: profiling CPU usage of a native, off-the-shelf application

One of the more demanding applications of low-level languages is database development, which requires both consistent commitment to coding discipline and strong performance culture.

In this example, we’ll apply off-the-shelf tools to gain insights into the performance of an open source C++ application, without introducing dependencies or code changes.

Setup

To perform the experiment, we’ll use the following:

  • 2 c7a.2xlarge AWS instances
  • Grafana Alloy installed on one server, running as root
  • Grafana Cloud Profiles, a hosted continuous profiling tool powered by Grafana Pyroscope 
  • A database executable built in C++. I’ve chosen DuckDB, as it’s easy to build from source and benchmark.

After generating the dataset, we can install DuckDB on both servers, generate a TPC-H dataset, and configure Alloy on one machine, using a fairly simple file:

discovery.process "all" {}

discovery.relabel "service" {
  targets = discovery.process.all.targets
  
  rule {
    action        = "replace"
    source_labels = ["__meta_process_exe"]
    regex         = ".*/([^/]+)$"
    target_label  = "service_name"
    replacement   = "$1"
  }
}

pyroscope.ebpf "process" {
  demangle = "templates"
  sample_rate = 20
  forward_to = [pyroscope.write.remote.receiver]
  targets = discovery.relabel.service.output
}

pyroscope.write "remote" {
  endpoint {
    url = "....."

    basic_auth {
      username = "...."
      password = env(“GRAFANA_TOKEN”)
    }
  }
  external_labels = {
    "instance" = env("HOSTNAME")
  }
}

The relabeling process is required so that the process name can be picked up as “service,” which is one way to set up profiles.

A quick look into Profiles Drilldown view shows us the data is flowing:

A colorful flame graph displaying performance data, with various labeled blocks in different colors and a tooltip showing detailed metrics.

Note the demangled C++ names, thanks to built-in demangling support in Alloy on Linux.

Benchmark with DuckDB

To validate the performance impact, we can use a simple Python script that populates a TPC-H database and then benchmarks queries against it using DuckDB’s own TPC-H extension and internal timer (“.timer on”).

Here are the results of the benchmark with 200 runs per query. We are not including warmups to ensure we can see the effects of profiling on cold runs. Run time is in milliseconds, with quantiles calculated using NumPy:

Results with profiling on:

Query

p50

p75

p90

p95

p99

p100

Q1

295

296

296

297

304.07

433

Q2

177

188

196

198.1

208.02

228

Q3

264.5

270

279

282.05

287

413

Q4

224

226

228

229

234.03

336

Q5

710

714

720.2

729

737.08

867

Q6

64

65

65

65

65.01

172

Q7

672

675

678

680

690.05

861

Q8

346

348

350

351

354

529

Q9

3202.5

3237.25

3303.4

3336.1

3398.28

3484

Q10

831

835

840.1

845

849

1052

Q11

122

123

123

124

126

163

Q12

163

164

165

165

165.01

279

Q13

877

884.25

896.1

903

912.03

2172

Q14

163

166

168

170

171.28

307

Q15

104

105

106

107

108.01

244

Q16

163.5

165

166

167

168

194

Q17

229

229

230

230

231

345

Q18

1945

1956

1967

1977.25

1992.04

2074

Q19

298

300

303

304

323.01

479

Q20

148

150

151

152

153.02

305

Q21

2027

2054

2072.5

2094.5

2151.19

2195

Q22

145

146

147

147

155.13

170

Results with profiling off:

Query

p50

p75

p90

p95

p99

p100

Q1

294

294

295

295

296

453

Q2

174.5

183

189

193

201.09

213

Q3

257.5

263

272.1

274

284.04

398

Q4

222

224.25

226

227

229.01

315

Q5

703

706

709

712.05

732.04

862

Q6

65

65

65

65

65.01

169

Q7

665

669

672

673

678.06

849

Q8

344

346

349

350.05

352.02

520

Q9

3208

3251.25

3287

3316.35

3391.05

3425

Q10

837

843

848

852.05

859.03

1560

Q11

122

123

123

124

124.01

141

Q12

165

166

167

167

168

299

Q13

893

899

904.1

911.05

922.01

2243

Q14

159

161

163.1

167

169.06

304

Q15

104

104

104

105

105.01

241

Q16

164

165

166

166

168.02

193

Q17

229

230

230

231

231

346

Q18

1956

1966

1975.1

1988

2024.73

2101

Q19

300

302

304

305

308.02

479

Q20

148

149

150

151

152.02

323

Q21

2045

2073

2113

2129.25

2161.01

2166

Q22

147

148

149

150

150.01

169

As expected, there are some outliers at p100 where the non-profiled run is slower because of statistical noise, but the general performance impact is consistently within a few percentage points.

Analysis with Drilldown

Let’s see if making that performance tradeoff gave us any useful insights. 

Using the DrilldownProfiles screen again, we can filter for duckdb executable by labels, and see the distribution of own/nested CPU time for functions:

A list of executables with associated time durations and percentages, displayed in a dark-themed interface.

Further context is provided by the flame graph view. We can limit it to ExecutorTask::Execute function here for better visibility:

A flame graph visualizing time distribution of various tasks and processes over a total of 3.97 hours.

Thanks to stack walking and demangling, we see a clear list of functions that use the most CPU in our scenario. If we were to attempt to optimize the runtime of the TPC-H benchmark in DuckDB, our first candidates for optimization would be:

  • duckdb::JoinHashTable::GetRowPointers
  • duckdb::JoinHashTable::InsertHashes
  • duckdb::JoinHashTable::ScanStructure::AdvancePointers
  • duckdb::GroupedAggregateHashTable::FindOrCreateGroupsInternal
  • duckdb::(anonymous namespace)::StringCompress<duckdb::uhugeint_t>

Benchmark with PostgreSQL

DuckDB is a relatively niche product, so let’s compare the impact on PostgreSQL. After installing it on the same machines and running pgbench, we can see a similar picture.

Results with profiling on:

pgbench (16.10 (Ubuntu 16.10-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 1.292 ms
initial connection time = 1.764 ms
tps = 774.093266 (without initial connection time)

Results with profiling off:

pgbench (16.10 (Ubuntu 16.10-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: 
scaling factor: 100
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 1.252 ms
initial connection time = 1.757 ms
tps = 798.648049 (without initial connection time)

For Postgres, the performance impact of profiling is within 4%. A little caveat is that a custom, non-stripped build of Postgres is required to enjoy full stack walking capabilities. However, immediate access to detailed performance data may be worth the tradeoff.

Outcome

With off-the-shelf software components, we were able to get a detailed look into the performance of a highly optimized codebase, and get clear targets for optimizing the CPU usage. The observed performance impact was minimal — within a few percentage points.

What we didn’t explore

There are a number of capabilities that we didn’t explore here, but are worth pointing out:

  • Off-CPU profiling: Both OpenTelemetry and Alloy profilers can also monitor off-CPU events, meaning periods when a thread is not actively executing on the CPU.
  • Performance comparison: Grafana Cloud Profiles supports the diff view to analyze changes in application performance over time.

These capabilities can further enhance the value teams get out of eBPF profiling.

Wrapping up

The state of observability is constantly evolving. Just a few years ago, getting real-time visibility into the performance of a native codebase in production required a continuous investment in custom-built tools — and specialized engineering skills.

Now, thanks to engineering efforts across the community, teams can gain detailed and comprehensive insights into native codebases with off-the-shelf components and continuous profiling. 

I invite the community to try the tools and services described in this blog and to provide feedback — we are always listening on our community Slack

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!