How continuous profiling can help track resource usage, reduce latencies, and more

• 8 Jul, 2022 • 9 min

In 2019, Polar Signals Founder and CEO Frederic Branczyk predicted that continuous profiling would be the future of observability.

Today, he’s making that future a reality with his open source continuous profiling tool, Parca. In this episode of “Grafana’s Big Tent” podcast, our hosts Matt Toback and Tom Wilkie chatted with Frederic about how he got his start in the continuous profiling world and how he’s built an active open source community around Parca.

Listen in to learn how continuous profiling can help you optimize resource usage and reduce latency, what Frederic is still figuring out about building a business, and how investing in automation leads to value.

Note: This transcript has been edited for length and clarity.

The rise of continuous profiling

Tom Wilkie: So Frederic, how did you go from Prometheus to profiling?

Frederic Branczyk: In 2018, I read a paper by Google where they described how by always profiling absolutely everything in their infrastructure, they actually have the knowledge to do something about resource usages. More importantly, they can actually understand what would be the biggest wins.

Google described in this paper that they were consistently able to cut down on infrastructure costs every quarter doing this. From some Googlers, I heard that some of the numbers were multiple percentage points every quarter.

I wanted this tool! And at this time, there was really nothing out there that did this. I was working on these super-performant, sensitive pieces of software, Prometheus and Kubernetes, and I just felt like I could have used this.

Tom: I remember in 2019, you and I gave a keynote at KubeCon and you predicted the rise of continuous profiling. Are you just on a mission to fulfill your own predictions?

Frederic: That’s right. It’s like they say – if you want to predict the future, you’ve got to build it yourself, right?

Matt Toback: We’re lucky you didn’t predict something diabolical . . .

Frederic: There’s always time for that! After I read that paper, I saw this opportunity and gap in the market. Because of my experience with Prometheus, I felt like I was in a position to build this tool. So I put together this barely compiling, barely working proof of concept that I also very creatively called ConProf, for continuous profiling.

Then in 2020 I decided to quit Red Hat. There wasn’t any company solely focused on continuous profiling, so I started the company Polar Signals to make it my full-time job.

After we spent some time at Polar Signals understanding the space, the technology, and working with a couple of users and customers, we learned a ton. We compiled all that knowledge into the open source Parca project. Parca is the evolution of the ConProf project. The whole point of Parca is to make everything zero configuration. You shouldn’t need to change anything about your setup, and you should automatically get profiling. That’s the philosophy of the Parca Agent.

What is continuous profiling and how does it work?

Matt: So what exactly is continuous profiling?

Frederic: One example that’s easy to understand is on-demand profiling. Say you see an increase in resource usage – say, CPU usage. And then you take a one-time profile to try to figure out what’s using more resources.

Continuous profiling is essentially doing this all the time.There are a couple of different types of CPU profiles, but the ones we concern ourselves with are sampling profiles. A sampling profile means that about 100 times per second, we look at the current stack trace of the program. Based on the stack traces we observe, we can statistically infer how much time was spent in these functions. That’s what CPU profiling is – just the aggregation of the stack traces that we’ve observed over time.

When you have all this data over time, you can compare the entire lifetime of a version of a process to a newly rolled out version. Or you can compare two different points in time. Let’s say there’s a CPU or memory spike. We can actually understand what was different in our processes down to the line number.

It’s super powerful, and it’s an extension of the other tools already useful in observability, but it shines a different light on our running programs.

“The whole point of Parca is to make everything zero configuration. You shouldn’t need to change anything about your setup, and you should automatically get profiling. That’s the philosophy of the Parca Agent.”
— Frederic Branczyk

Tom: What kind of overheads are entailed if you’re constantly dumping the stack of the running applications?

Frederic: This is a really big concern for people because they already do logging, tracing, and metrics, and they’re already concerned with overhead. And profiling is traditionally viewed as a heavy operation.

Sampling profiling was the first innovation to help with this overhead. We can adjust the sampling rate — 100 times per second, 50 times per second, etc. — to change the overhead. But ultimately, the biggest reduction in overhead we were able to gain was by fundamentally changing the way we obtain the data. Today, we’re doing this using eBPF.

Using eBPF, we only capture the data we need in kernel space, and then we only export it from kernel space every ten seconds. That’s where we saw a huge reduction in overhead. Users are seeing less than a percent of overhead in CPU usage while rescuing CPU time from 10 to 30 percent. It’s almost always a win that we’ve experienced.

Lessons learned while building an open source business

Tom: So you talked about the overhead on the application that’s being profiled and how easy it is to get it to profile. It sounds like it’s going to generate a lot of information. How do you store and analyze that?

Frederic: To be entirely honest, we’re still figuring it out. We’ve definitely gone through several iterations, and we get better with every iteration. It’s starting to look like a purpose-built columnar store.

The other part is that there’s a lot of metadata involved: function names, file names, line numbers. And if we look at Kubernetes, that involves millions of lines of code. I think we’ve kind of figured out how we’re going to manage that. It’s essentially a key-value store that we request huge amounts of keys simultaneously from, and we’ve built hyperscaled joins in a distributed database based on key-value stores.

Tom: Very cool. So let’s say I’m tracing all my applications 100 times a second, am I going to be worried about my network bill?

Frederic: The amount of resources needed changes almost every day. We’ve gone from maybe 100 gigabytes down to maybe 20, but we have a lot of plans to reduce it even more.

“By working with customers in e-commerce, we realized that things like latency optimizations were going to be huge. . . . That was so helpful in understanding what people actually need.”
— Frederic Branczyk

Tom: Nice, nice. Calculating resources needed has a big impact on how you might make a business out of this – deciding what to bill for and what your margins will be. Can you share how you’re thinking about building a business around Parca?

Frederic: It’s essentially the same thing that Grafana Labs does. We intend to offer continuous profiling as a service, so our customers can focus on their business. We’ll take care of strage, scaling, and maybe a few features on top that are interesting for enterprise customers.

Matt: How has the community’s desires for the project either aligned with or differed from where you wanted to take it?

Frederic: Working at Polar Signals directly with users and customers, we got so much helpful feedback. We had our own ideas, but by working with customers in e-commerce, we realized that things like latency optimizations were going to be huge. They actually don’t care as much about cost saving in their infrastructure; they care way more that they can increase their conversion rates by having lower latencies. That was so helpful in understanding what people actually need.

How to create a community around an open source project

Tom: How did you go around building a community around Parca?

Frederic: I wanted to be really intentional about removing what we call “toil” in SRE land – all the things that don’t really produce value by doing it manually all the time. We wanted to keep the focus on actually creating value.

So everything in the release process is entirely automated. When we tag a release on GitHub, that automatically triggers the release pipelines, publishes the changelog, then pushes the container images to a registry, and if the container images have been successfully uploaded, it redeploys the documentation and re-templates everything with the latest version. With this automation, we’ve been able to create 20-25 releases in the two months that Parca has been released. It frees up so much time to focus on producing value.

“Everyone craves the instant success with open source projects. But I’ve learned that sticking to something and going with it for the long run means you’re so much more likely to actually produce something useful because you’ll get user feedback.”
— Frederic Branczyk

Tom: One of the things we tried to do when we launched Grafana Loki and Grafana Tempo was to try and be really open and accessible as a development team. I love that style of working. Do you do similar things in Parca?

Frederic: Yes, exactly – this is something we wanted to be very intentional about. So we have Parca office hours every two weeks at 5 pm UTC on Tuesdays. And we use Discord, and it’s a joy to use.

Tom: Do you have any advice for anyone who wants to launch their own open source project and build a community around it?

Frederic: In my experience, everyone craves the instant success with open source projects. But I’ve learned that sticking to something and going with it for the long run means you’re so much more likely to actually produce something useful, because you’ll get user feedback.

The ConProf project was not very well visited in the beginning and the documentation wasn’t particularly good. But it still grew to 800 stars simply because every month or two I did another commit. As long as you’re solving something that you’re genuinely interested in, it really helps to just stick with it.

Don’t miss any of the latest episodes of “Grafana’s Big Tent”! You can now subscribe to our new podcast on Apple Podcasts and Spotify.

Feedback

How continuous profiling can help track resource usage, reduce latencies, and more

The rise of continuous profiling

What is continuous profiling and how does it work?

Lessons learned while building an open source business

How to create a community around an open source project

Related content

Feedback

How continuous profiling can help track resource usage, reduce latencies, and more

The rise of continuous profiling

What is continuous profiling and how does it work?

Lessons learned while building an open source business

How to create a community around an open source project

Related content

What is observability? Best practices, key metrics, methodologies, and more

Remote work done right: How remote-first companies put people first

Inside Grafana Labs hackathons: how they work and what projects ended up on the product roadmap