Smarter observability with Grafana Cloud: Collect the data you need, when you need it, and know what it costs

In this session, you’ll learn how Grafana Cloud ensures that you’re collecting exactly the data you need, when you need it – and that you understand what it costs, too.

Grafana Labs VP Engineering Dee Kitchen, Distinguished Engineer Sean Porter, and Staff Product Manager Rich Kreitz will demo the latest additions to the Adaptive Telemetry suite, which dynamically adjusts what data gets ingested based on your usage patterns and signals from your infrastructure. Building on the success of Adaptive Metrics and Adaptive Logs, Adaptive Traces and Adaptive Profiles are designed to make every byte worthy of your attention. They will also showcase new capabilities that help you understand what that data means, including automated performance and cost-optimization recommendations from your profiling data, and an LLM-powered approach to trace analysis that surfaces patterns and spans of interest, so you can spend less time searching and more time solving.

Finally, they’ll introduce our revamped billing experience, featuring granular cost attribution, support for the open source FOCUS standard, and a new alerting flow that helps you avoid surprise bills. These improvements are built with FinOps in mind – empowering teams to understand and own their share of observability costs. Whether you’re chasing performance, reliability, or efficiency, Grafana Cloud’s telemetry backends built on Loki, Tempo, Mimir, and Pyroscope are evolving to do more of the heavy lifting for you.

Dee Kitchen (00:00):

I'm actually only going to be up here for a few minutes because they asked me to speak and they asked me to come up here and actually tell you all about the databases and adapt to telemetry and a lot of the cost attribution of things as well. And I'm covering the SaaS economics reimagined and this is how to run at scale and how to actually do this in an efficient way. But what I didn't want to do is be up here as a manager and trying to tell you things, because who cares? So I actually wanted to bring up two practitioners and we're going to hear from one and then the other. Sean Porter, you heard from yesterday, he was in the main keynote talking about Adaptive Telemetry and Adaptive Profiles, and he's going to come up first and give you a deep dive into what that is, how it's going to work, how you can use it, and how it's going to be valuable to you. There is an underwriting thing to this, which is in Sean's startup, he had a phrase and a phrase resonated with us, and the phrase was "Every trace worthy of your attention." And as soon as we saw that, we were just like, yeah, but how about all telemetry? Right? "All telemetry worthy of your attention." You should only sort of pay for that which is valuable to you and the rest you should kind of like. You shouldn't have the cost of that. So with that in mind, here's Sean to give you a low down.

Sean Porter (01:27):

Alright, it's great to be up here again, thank you Dee for that nice intro. I'm going to go ahead and first I'm going to apologize. Some of the things are going to be kind of almost verbatim from yesterday's keynote, but it'll really stick this time. Adaptive Telemetry, this is the grand vision we're all working towards and it's really about getting rid of the noise, extracting the signal, and making every bite stored worthy of attention, whether that's metrics, logs, traces, or profiles. So Adaptive Metrics kicked it all off. We have over 3,900 organizations already using it, and in total we've dropped 16 billion active series on average 1.4 billion a month now, and that's a big, big number. Last year, Adaptive Logs went generally available. We built Adaptive Logs twice as fast as we built Adaptive Metrics. Already, there's over 380 organizations using it and we've dropped a total of 12 petabytes of less valuable log lines.

(02:42):

Today we're here to announce, well, yesterday we announced Adaptive Traces and Adaptive Profiles and we're going to dive into those solutions in depth. Tracing is powerful. Again, it provides end-to-end complete visibility as requests. Traverse your complex and distributed systems. Really it's a telemetry signal that's for deep visibility. Story time, it's 3:00 AM a team member is paged. They wake up, they're all groggy. They go to their traces. They see that the checkout system is only succeeding about 12% of the time. They can look through that trace, that whole request narrative. They can see the fraud detection component is timing out or experiencing a high amount of errors. With that, they see there's a call to a third party. They look at that and it turns out the third party made a very small change to the API at, say, 2:45 in the morning. Small change to the code, deploy it, incident is over and resolved. Tracing in this case gave you that full context of that, made it really easy to dial in and where you need to address the problem and you could fix it quickly.

(03:57):

Problem is, traces are very noisy. They produce an overwhelming amount of data that is also very cost prohibitive. Most of these traces represent normal successful operations. You really generally don't give a hoot about these ones. The real insights come from the small percentage of traces that are around errors, performance issues, unusual patterns and so forth. Extracting the signal from those and paying attention to those is what you need to do with traces. So, Adaptive Traces, we're focused on keeping only the valuable ones, the ones worthy of your attention. You send all of your traces to Grafana Cloud and we'll intelligently sample them. You get immediate cost savings and it gets rid of all that noise and makes it a lot easier to go through your traces and figure out what's going on. Really reduces the toil for the day-to-day operator.

(04:56):

A big benefit of this is also it makes it a lot easier to instrument your systems. Auto instrumentation is absolutely fantastic, but boy, it's like 70% of that information you probably don't even need or use. So you can let Adaptive Traces take care of that toil and reduce it for you. Engineers spend far less time sifting through their traces, helping them identify issues faster. So really, Adaptive Traces is unlocking the true potential and value of traces for you. So, distributed tracing inherently relies on sampling. It's a fundamental necessity. There's just too much data. Do you really need all the data? No. So most organizations actually employ a sampling technique referred to as head sampling. This is making a decision at the very beginning of the request without any understanding or knowledge of what's going to happen for that request. It's just a sophisticated random method of sampling. Adaptive Traces used, what's referred to as tail sampling. We basically at the end of a trace, we bring it all together and we have all that complete context. We can inspect the whole trace and make a decision then, when you know everything about it.

(06:16):

What's also cool about Adaptive Traces is we can use AI ML capabilities to look at these full traces at the end. All that context can be very beneficial for these kind of techniques and models. With Adaptive Traces, you're essentially shifting away from sample randomly and pray and hope, to intelligently sample what actually matters, what has value. So thankfully, we have a recommendation engine to help with the creation of sampling policy and Adaptive Traces. Sampling policy is what's determining what traces are kept, stored, or dropped. So when you start using Adaptive Traces, out of the gate, it'll give you these kind of foundational ones. These are very broad strokes, not so much interesting, but it certainly helps you get an idea of how Adaptive Traces works and how to create policies. So, here's three examples. Let's just capture probabilistic representation. 5% of everything. This gives you your baseline. This will include some of the successful operations so you know what the good stuff looks like, and then we have these other policies to capture errors and slow traces so you can compare with that baseline.

(07:32):

What's really neat is that we'll continuously analyze your adaptive traces after you start with these ones and from that trace data, we'll determine what other policies we should recommend to you. Perhaps you have a service that's consuming 80% of your spend and you're not really getting that much value out of it. So let's adjust that, reduce it, and perhaps you have a few services that only produce traces once an hour or once a day or a week or month. Perhaps we should sample those more often. Of course, you can always use custom policies to really laser focus. You and your teams know what's important to you and you can represent that as policy. Now, if you and your teams are already familiar with OpenTelemetry and you're attempting to do tail sampling, all of those policies will just work out of the box with Adaptive Traces, copy paste, away you go. These are few types we support and gives you an idea of what capabilities are there for you to utilize.

(08:37):

Shortly after creating your first policy, you'll see a dramatic reduction in the amount of traces stored in Tempo. These numbers on the screen are actually from our internal ops cluster. They represent kind of our average rates that's per second. So we're doing 2.5 gigabytes per second in this internal cluster and restoring just shy of 400 megabytes a second, so that's an 84% reduction. There are times we get spikes where we go up to seven gigabytes per second, but this is the average. This is an idea of the scale and also the effectiveness of Adaptive Traces in action. So as mentioned earlier and during the keynote, we continuously analyze your traces and we produce recommendations for you, but we also use AI and ML capabilities to monitor your services for weird behavior, for service anomalies. This infographic, if you will, represents like a P 90 for latency for service. You have your kind of normal operation and then its steady state, started thrashing, started changing and then spiked and then you had an incident triggered. So we have systems that are automatically at scale looking at all of your services for this kind of anomalistic behavior so that we can have it automatically capture the relevant traces, preemptively assuming that this behavior is going to eventually lead to an incident.

(10:11):

So here's an example of one in action, and if you go to use Adaptive Traces today, this is what you'll see under policy details. For anomaly policies, the top is a little hard to see on the size of the screen here, but we're using, in this case a machine learning model for forecasting P 90 latency and we're not so interested about capturing traces above that threshold. We're interested in capturing traces when that threshold changes. So when it spikes, we automatically create a policy that will capture traces with spans with those attributes and then we'll cap it. So we're not going to, maybe an anomaly actually equates to 80% of your trace volume. You wouldn't want to capture everything and then blow up your bill or make Tempo cripple and fall over. So we actually limit the intake so you can see, we can forecast anything that falls out of it we're going to capture.

(11:14):

And what's really neat is we have this kind of deep investigative workflow, so it's really easy to see, hey, I have an anomaly policy or I had one. Click the drilldown button, takes you into drilldown traces and you actually see the relevant traces that were captured. It makes it really easy to do investigative triage with traces. Alright, so I'm going to just first show you what it looks like to get started with Adaptive Traces. It is very straightforward. Out of the gate, we're going to just recommend a single policy called First Policy. I know not very creative, but it's going to do a probabilistic sampling, so it's going to capture a baseline of everything. All I have to do is click this button, apply recommendation. Unfortunately somebody hijacked my stack earlier today, so there's a little bit of weird data here, but we'll see now with that first policy, we're going to start to have a reduction from our traces going through.

(12:13):

Now out of the gate we recommend those foundational policies and I can apply them with a click of a button. So if I look at this one here, it's just recommending, hey, let's just get slow traces. So anything greater than 10 seconds, if I click apply, it created the policy, I can go look at my policies. So there's that first policy we created by turning on Adaptive Traces. So we're capturing 5% and then we're capturing slow traces. What's really important to note, I'm sure you're looking at like 5%, that's either far too high or far too low for us. No problem. It's quite straightforward to edit that policy and make a change to it. Let's say 10%, but please for the love of God, rename the policy something more interesting like sample 10% and update. There we go. And what's interesting is that by that editing, it's immediately applied to your sampling.

(13:21):

So in the case of our ops cluster with that one, I showed the numbers earlier, we're doing 2.5 gigabytes per second. Some fun nerdy tidbits there. There's several hundred samplers behind the scenes, and when you edit that policy, we're immediately applying to all of them. What's also fantastic about Adaptive Traces, is it is extremely robust and reliable. If you and your teams are familiar with OpenTelemetry, while it's fantastic and has all these capabilities, it is hard to make a robust pipeline with those components. While we're leveraging OpenTelemetry standards and componentry. With Adaptive Traces, we've made an incredibly robust system that can take these live updates without dropping and losing data. If I go back to the overview page, I can go apply, let's just dismiss that one. That's an artifact of a weird bug. And this one I'll apply to grab traces containing errors. Wow, that's an impressive reduction. So we're at almost 99% reduction rate. This will change and settle in and it really depends on what my policies in play. So I have another stack already running. This one has a few more policies and including a current anomaly that's been detected. So what's happened here with this stack, this is using the OTel demo stacked to produce data. But here if I look at here, that one, let's choose an interesting one. We can just click through.

(15:01):

Alright, this one works pretty good. So this is a checkout cart. You can see the forecast down below. So you can see we're doing them all and anytime things have kind of fallen outside of what we expected and these bands, you can see that these peaks on the forecast correspond with us capturing traces that are relevant to it. If I click drilldown, I can continue that investigative workflow and I can see what the breakdown looks like, what services were actually involved with that anomaly, and I can look at the traces themselves. This is a very boring trace, not a good one to look at, but you get the sense of the capabilities here. So at any time as well, you can go in here, you can create a new policy and let's do important service. I like the name of that one. So again, if you're familiar with OpenTelemetry and the tail sampling processor that exists there, we support all those policies.

(16:15):

Most often in the not, you're going to be doing what's called an and policy type. This allows you to combine multiple policies together. So here in the example, it's giving me a string attribute policy that's matching off of service name. So perhaps you have a aspect of your infrastructure or your services that are just too critical to sample too aggressively or they're too noisy. You can drill in on specific service. And here the example is saying let's just apply a probabilistic sampling policy in addition to that. So the traces have to be touching that service and we'll do a 20%. There are a number of policy types in here. You could do rate limiting. You can do, again, if you're really familiar and really deep in the weeds with OTel, OTTO conditions. You may not like yourself that much, but that's what we have to do. But yeah, so you can really just drill in. If you want to learn more about these individual policy types, we have the booth outside, please come by and we can get into the nitty gritty of those details. Alright, I don't want to use up all the time now, Rich. You've got a whole shtick to do as well. Let's switch back to the slides and I'm going to talk about Adaptive Profiles.

(17:36):

Awesome. So yeah, it wasn't enough to do just one. Previously we went from metrics to logs in a year. We had to outdo ourselves and do the last two in the final year. Profiling. Profiling is a very underutilized signal, and most of you probably don't use it, but it's incredibly powerful because it gives you the ability to go from a signal to a line of code. You can point at this is the optimization opportunity, changing this line or reduce CPU usage by 5% by continuously profiling across your infrastructure all the time, not just looking for optimizations and shutting it down, but run it all the time, it can be an incredibly useful and viable tool.

(18:23):

So with Adaptive Profiles, you can send all of your profiles from either Pyroscope SDKs or Alloy collectors to Grafana Cloud and we'll intelligently sample them. By default, we apply a cost-effective baseline sample rate of only 10%. We found this fairly aggressive sample rate to not compromise the structure of the traces, meaning you can look at your heat graph and everything looks as it should. Now when there's enormous change within the shape of a profile over a window of time, we react to that and increase the sample rate to a hundred percent. Basically we anticipate the need to look at that profile. It also kicks off a workflow that we run LLM analysis of it and we can look at what's wrong with it, what's interesting about it, what are the opportunities that relate to it? Did somebody introduce a regression and so forth.

(19:26):

I think the true value of this is definitely in these insights. So if we trigger that resolution bump, we run the workflow. This is an example of the output. Again, this is from a demo stack, so it's not the most interesting profile, but we can make very specific recommendations of the improvements you should make. This one, yes, find nearest vehicle. This is just calling time too often. Maybe you should just do it once and do assignment. It's silly things like this that actually equate to huge reductions in cloud costs though. Seems some overly simple, but these regular expressions most of the time.

(20:09):

So with Adaptive Profiles, teams can deploy continuous profiling at scale as never seen before, at a lower cost than ever been seen before. This is going to give you the ability to optimize your cloud spend and simplify debugging. Alright, quick demo and then I'll hand it over to Rich. Save myself the trouble of having to have a separate stack for showing you this disclaimer. This is a screenshot, but this is what you'll see when it comes to turn on Adaptive Profiles. What you need to do it is you basically click a button, enable and make sure the LLM plugin is configured for your stack, and that's all it takes to enable it. And immediately you'll see roughly a 90% reduction, greater than the 85%. Why I say the average is an 85% reduction is to give buffer for that anomalistic behavior, so spikes in resolution to support that insights workflow.

(21:12):

Here you can see two days ago there were some insights, but I think I'll go and we'll get another service. So, I've applied that 10% baseline to everything, but you can also get more specific per service. So if you feel like it's either compromising the shape of your profiles or you just want a higher resolution on the get-go or the regular, you can accomplish it. So let's do already have checkout service, fraud detection, and we can add it here. Now at any time, if I am not going to wait for an anomaly to occur, I can click a button to increase the sampling rate and you can say I want to increase it for say one minute and then it's the demo as a reason. I'm just going to increase it. So just like Adaptive Traces can dynamically apply policy across Adaptive Traces, clicking that button will immediately impact the sample rate for the ingestion of these profiles. Now within one minute we'll get some fresh insights in here, but while we're waiting, let's just look at what they can look like front.

(22:24):

There we go. So this is what an insight typically looks like. So it basically gives you the flame graph and then makes a number of recommendations of how you might want to change your application. If I click the go to drill down button, you can see it had this sample rate applied to it over this period of time. By any point, I can look at function details, just wait for a minute and then I can view on GitHub if I had the integration working for this component. But in a gist, that's what Adaptive Profiles looks like. So you turn it on, does that baseline, looks for anomalies when there's anomalistic behavior, you get insights. Insights tell you how to change your application to improve the performance of your app. If you're interested in Adaptive Profiles, and I hope you are, you can use this QR code to sign up for the private preview and I love to have conversations with you, find out why you're interested with profiles and how this can help you and yeah, it's incredibly exciting. If you don't profile today, Adaptive Profiles is a great excuse to start today. I would like to hand it over to Rich and he is going to talk about cost management and billing.

Rich Kreitz (23:43):

Thank you Sean. Observability is usage based. This is both a strength and a risk. One chatty service, one debug flag left on, one high cardinality label and your costs can spike overnight. Maybe it happens on a Friday deploy. Monday, Finance pings engineering and asks, "Did we really mean to spend this?" The answer is usually we don't know yet. And by the time you do know it's too late, that money's already spent. These surprises are more than just about numbers. It erodes trust. Engineering feels blindsided, finance feels exposed. Conversations that should be about reliability, become conversations about budgets. Nobody wants that. So what's the real problem here? Latency has dashboards, errors have SLOs, but costs usually only show up on an invoice. If the first time you see a spike is on a billing statement, you've lost the most important thing you have, time to react.

(24:50):

So we want to make costs a first class signal. If we can alert when latency passes 500 milliseconds, we should be able to alert when logs cross 500 gigabytes. If we can route incidents to service owners, we should be able to route spend to the teams that created it, and if we can normalize metrics and traces, we should be able to normalize bills so that everyone's looking at the same numbers. This is the shift we want to make with Grafana Cloud. We call it moving from cost chaos to cost clarity, and it happens in three steps. Step one is prevent. We send cost and usage signals to the control loop you already trust. We have easy human readable alerts sent to Slack, email, or the web hook of your choice so you can get alerts to build your bill where there's still time to act.

(25:48):

Step two is assign. When costs are visible, behaviours change. We use the the labels you already have to break one big bill into slices by team and service. Every dollar has an owner and owners can act. And step three is Unify. Every vendor has a different billing language. AWS, GCP, Grafana Cloud. This fragmentation costs your finance and FinOps teams hours to reconcile. With Focus, a cost-based open standard, there's one schema that both engineering and finance can use to look at the same numbers the same way. Apples to apples across all your vendors. As we saw with Sean with Adaptive Telemetry, we want you to get the right data at the right resolution so you only pay for signal and not noise. But the real win here is cultural. When costs become visible, behaviors change. Teams can budget, optimize and look at their spend the same way they deal with latency and reliability. Costs become observable, assignable, and open.

(27:02):

So I'm going to show you a little bit of how we do this in Grafana Cloud. So this is our new cost management and billing app. This was released just last week now in GA. And the team has done an amazing job of visualizing how we look at your bill today. So this is the front page kind of gives you a high level view of your spend, commit, what your trending month to month, and a bird's eye view of your breakdown by product. I'm going to click into the usage alerts tab. We've made alerting for billing really easy, all contained within the cost management app. I'm going to go ahead and create a new usage alert and you can now alert on your total spend or on usage, and I'm going to choose logs. Now as I do this, the graph here is going to shift to what my current logs usage is for that month.

(27:58):

I can also go back in time to see what this trend has been historically, or if I know what my team's budget is, I can put that in as well. So I'm going to choose two terabytes of logs here. Now we'll put that in. We get the yellow line up there that shows this is what my budget is for this current billing period. So that's what this is for this month, and we have predefined alerts for you based on different thresholds. You can add your own here, but I'm going to create alerts because I want to know when my spend gets to 95, 85 and 70% of my alerts, and we see the lines show up in the graph here, scroll down and I can choose where I want the alerts to go and I'm going to choose to send this to Slack and I create alerts and that's all it takes. I now have cost management alerts that are going to alert me when my bill gets to these different thresholds within the month. So this gives your teams time to act rather than waiting for the end of the month when the bill's already spiked. They can react in real time to address what these are doing.

(29:01):

Step two is assign. For Assign, we have cost attributions. Cost attributions lets you define labels across all your signals. So they map to one team, one service. We let you configure that differently for metrics, logs, and traces. All these products use different semantics and different ways they kind of use labels. We can map that to one team. So if I scroll down, I can see here that here's all my different breakdowns for metrics, logs and traces. We see the usage values, we see the dollar values, and these are being mapped to the different labels that I've defined. This is ordered by dollar value and we see at the very top, this is unattributed. So the goal here is we want these attributed values to match cent for cent to your actual bill. So you can actually reconcile this with finance. If you're doing show backs or chargebacks, then it's super important to understand how this works. If you have a bucket of money in unattributed, then you're going to want to look at your labeling strategy so you can actually get teams to label properly so that you can charge them.

(30:03):

We also support different months. So this is the month to date billing period. You can go back in time to see what your attribution was in different periods. And of course we do have an export to CSV. So if you want to import this into spreadsheets or different tooling, we've made this really easy as well. The whole goal with cost attribution is to give teams visibility into their spend so they can manage it, so they can know where they're at and actually budget for it. If we go back to the slides, the third step is Unify. So today, every vendor has a different billing language, different values, different ways they roll things up. This costs the team lots of time to reconcile spreadsheets. They're moving data, they're transforming it. Spending hours and hours trying to match what your spend is between CSPs, between observability and all your SaaS products. Focus is an open standard for cost and usage data. And Grafana Labs loves open source. So this was a no-brainer for us to adopt. This gives you a common schema so that all your data's coming into one format in the tools that you already use. No more spreadsheet wrangling, no more parsers, no more one-off sheets. This gets you all the data together so that your teams can spend less time trying to manipulate data, and spend more time understanding your spend and optimizing how that works.

(31:32):

So let's bring all this together. As we saw with Sean, Adaptive Telemetry lets you pay for signal and not noise. Alerts help you prevent surprises so you can catch spend early and actually do something about it. Cost attribution lets you assign cost to different teams so that they can own what that cost is. And focus is an open standard that lets you look at all of your billing data in the same format so you can concentrate on optimizing your spend rather than manipulating data. The outcome of this is all of your costs become observable, actionable, and fair. That's Grafana Labs. Thank you very much.

Speakers

Dee Kitchen
VP, Engineering — Grafana Labs
Sean Porter
Distinguished Engineer — Grafana Labs
Rich Kreitz
Staff Product Manager — Grafana Labs

Smarter observability with Grafana Cloud: Collect the data you need, when you need it, and know what it costs

Speakers

Dee Kitchen

Sean Porter

Rich Kreitz

Still have questions?

Get every update