How to boost observability ROI with continuous profiling and Grafana Drilldown

• 2025-09-30 • 8 min

For the longest time, observability was centered around logs, metrics, and traces, but the growth of more complex systems has made continuous profiling another essential part of maintaining healthy systems. It provides insights into resource usage and latency down to the code level, delivering key insights to improve performance.

Although the popularity of continuous profiling is on the rise, not everyone realizes that it can be as valuable—if not more—than the more established observability tools. In a GrafanaCON 2025 presentation, Ryan Perry, one of the founders of Grafana Pyroscope, the open source database behind Grafana Labs’ profiling tool, Grafana Cloud Profiles, set out to change some minds. He highlighted the benefits of Profiles Drilldown (a queryless experience for Pyroscope), and why it’s worth caring about continuous profiling. One big reason: It can reduce spending.

Perry also believes observability works best when you have a whole suite of tools at your disposal, because “you end up getting the best tool for whatever task you’re trying to do—which sometimes is profiling, sometimes it’s traces, logs, metrics, load tests, whatever that might be.”

Profiling is experiencing a shift, and during GrafanaCON, Perry noticed a lot of people mentioning they were doing some sort of profiling, whether that was on data in Amazon S3 or for a little pprof file on a desktop.

“We’re seeing a much more rapid shift from that level of profiling to a more continuous sense, where you can have something that goes wrong in production and you don’t have to try to recreate it to get a profile associated with that,” Perry said. “Instead, you just query the Pyroscope database and you can pull profiles of what’s going on from a performance standpoint from your application, or pull flame graphs.”

Profiling’s ROI value

When it comes to observability data, Perry said one big question drives a lot of conversations: How much are you collecting versus how much value are you getting back from that data?

One reason Perry thinks profiling is unique compared to the other signals “is that it’s the only tool that really directly optimizes both sides of the ROI equation.”

The information you get from flame graphs can help you increase throughput, improve user experience, and reduce latency—all of which can increase revenues derived by your application, Perry said.

Infographic on profiling ROI: Revenue with increased performance, reduced latency, better UX; Costs with reduced spending, waste, faster recovery.

The benefits on the cost side, meanwhile, are unique to profiling, Perry said, because “you’re able to basically see where your costs are going.” It’s commonly used for memory and CPU, but it also can be used to see how much any resource is costing, which helps optimize and minimize those costs.

“A lot of times—as we’ve seen with many people using this,” Perry added, “it tends to be some easy, low-hanging fruits of things that you didn’t realize were using as much resources as they are.”

Proactive vs. reactive use

Another benefit of profiling that Perry covered was how it can be used as both a proactive and reactive signal.

“With metrics, logs, and traces, a lot of times you’re going to get most of your value from those,” he said. “When something’s on fire, you’re able to get an alert and you kind of go down the path of investigating it.”

On the other hand, as soon as profiling is turned on, you can proactively discover something you wouldn’t know otherwise, such as what is the biggest bottleneck in your application, or when traffic spikes, what’s the first thing that’s going to fall over and cause memory to run out.

“On the opposite side,” Perry explained, “when something does happen, now you already are familiar with it.” As a result, you can react to an incident or a latency issue or a customer request and use profiling as needed.

Profiling and cost cutting

According to Perry, the number one question the Grafana Labs team hears when people try using Pyroscope for the first time is, “What’s the overhead and how much is it going to cost me?”

The answer: “What we usually say—and what is true about it—is that it is very low overhead because we’re using sampling profilers,” he said.

For people who had problems with how much overhead profiling added to their application in the past, Perry said it’s likely they were using a different kind of profiler. “This one is just simply sampling the stack trace a hundred times per second to be able to understand what’s going on in your application, and then it’s sending that off to whatever backend.”

It happens from a separate process, unlike what goes on with metrics, logs, traces, or other signals, where a code is usually inserted into the path of an application. By default, that requires much more overhead for the application than profiling ever would, he explained.

Perry noted that one of the most common things people discover when they add profiling for the first time and start using Pyroscope is that they were, perhaps, accidentally spending 10% of their CPU logging something useless that shouldn’t have been logged, or they had too many trace spans, which were causing issues.

Having that visibility into what your application is actually spending its resources on is, in Perry’s admittedly biased opinion, “always worth the 1% overhead to understand what’s going on with the other 99% of your resources that you’re spending.”

Bar chart highlighting cost savings from continuous profiling.

As the bar chart above illustrates, observability is just a sliver of the total savings profiling can deliver. Messaging and queuing systems are one of the most popular areas where profiling can save money. “If you ever have overflowing cues, Kafka consumers, sidekick workers—whatever your queuing system is—profiling is always a very, very good tool because it’s notoriously an area where you can just send something to your asynchronous land and just forget about it.” That can lead to a lot of bloat and problems, “and profiling will help you fix that,” Perry said.

Real-world use cases

To help highlight the performance and cost-saving benefits of profiling, Perry discussed two companies that have used it successfully: Uber and Shopify.

In the case of Shopify, Perry quoted Elijah McPherson, an engineering director at that company who told him, “I have no idea how I did my job before continuous profiling.” McPherson is focused on efficiency and performance, which are critical when it comes to providing fast, reliable experiences to Shopify’s customers and the company’s own bottom line. On Black Friday, Shopify processes so much money through their platform, that one minute of downtime can mean almost $3 million in lost revenue.

To prepare for one of their holiday shopping seasons, they started using profiling in order to get zero downtime or resolve whatever downtime they had as fast as possible. “They actually had some really cool examples that they found using Pyroscope and profiling too,” Perry said. “They made some improvements to various Rails libraries, system libraries, and things like that. They were able to then decrease their CPU by over 20%, which at that scale is a massive amount that they’re then saving on their infrastructure.”

Uber, Perry explained, used profiling to decrease the amount of cores they had, and they deployed it across their entire infrastructure. The payoff? They were able to cut more than 10,000 cores, “saving a lot of money on their infrastructure.”

Getting more value from your profiles with Grafana Drilldown

Perry went on to discuss how you can get even more value from continuous profiling using Profiles Drilldown to “drill down into the root cause of an issue,” which helps users save time and resolve issues faster—all without writing any quries.

Although you start at a high level, you can easily drill down by profile types or the labels for a service. A bottleneck may show up clearly in a flame graph, for example, but there are other areas to look at as well.

Profiles, in fact, allow you to observe more than what you would see just using metrics. If you notice something using a lot of CPU, for example, you can know which functions are associated with it, and that can tell you “specifically why you’re running out of CPU as opposed to just showing you that you’re using a lot of CPU,” Perry explained. “You know down to the line of code.”

He went on to describe how you can connect the function details section to GitHub and see the actual function that is running and what is going on with it. And he noted that if you’re using traces, you can sort them to see, perhaps, which trace was the slowest then examine a profile associated with one specific trace span. “You know more than you would with just tracing,” Perry said.

To learn more, check out Perry’s presentation, “Why you should care about continuous profiling and how to get started with Profiles Drilldown in Grafana.” In addition to the benefits outlined above, Perry describes how profile data is efficiently converted to flame graphs and walks through a demo using a sample rideshare app to show what the Grafana Drilldown app looks like and how you would solve an issue inside of a UI.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

How to boost observability ROI with continuous profiling and Grafana Drilldown

Profiling’s ROI value

Proactive vs. reactive use

Profiling and cost cutting

Real-world use cases

Getting more value from your profiles with Grafana Drilldown

Related content

The new, queryless UI for Grafana Pyroscope: Introducing Explore Profiles

AI-powered insights for continuous profiling: introducing Flame graph AI in Grafana Cloud

How to use PGO and Grafana Pyroscope to optimize Go applications