Opening Keynote

Join Grafana Labs CEO Raj Dutt, CTO Tom Wilkie, and engineering leaders to kick off ObservabilityCON 2025 with the latest in AI-powered observability in Grafana Cloud. Learn how we’re evolving our open observability cloud to help teams detect, understand, and act faster. At Grafana Labs, open has always been our strategy: open source, open standards, open data, and open minds. As complexity grows and AI reshapes how teams build and operate software, learn how that open strategy is helping organizations get value from their telemetry and turn signals into action – and observability into competitive advantage.

Raj Dutt (00:00:00):

Hello everyone. My name's Raj, as was just announced, one of the co-founders and the CEO of Grafana Labs. It's my great pleasure to welcome you to ObservabilityCON2025 here in London. Is everyone excited to be here? Make some noise. Yeah, fantastic. I'm really excited to be here. I'm excited, I'm proud, I'm grateful. A whole bunch of emotions. It's really nice to be back here in London. Let me find the clicker. And yeah, so welcome again. We've got a great keynote for you. You're going to be hearing from all sorts of highly credible, highly smart, really passionate people today. I'm really proud to work alongside them.

(00:00:44):

Before we get started,I just wanted to do a quick thank you to our sponsors, Alibaba Cloud, Luciq, Embrace, Causley, Nearform, and a special thanks to our Guru and Pioneer sponsors, Google Cloud and AWS. I have to say, Google Cloud and AWS, our relationship with these hyperscalers is very interesting. We have valued partnerships with them, they're customers, they're vendors. We certainly spend a lot on AWS and GCP, but they're also competitors, right? So hashtag it's complicated. As an open source company, it's often kind of interesting to navigate, but we have a tremendous relationship with Google Cloud and AWS and we really appreciate the partnership. So thank you to all our sponsors for this event. If it wasn't for them, we wouldn't be able to put this on at the prices that we do, so please give it up for our sponsors. Thank you.

(00:01:42):

Alright, so you're at ObservabilityCON here in London. This is the event that we talk about, our flagship product that we've been building really over the last nine years now. Grafana Cloud, of course. We also have another annual flagship event called GrafanaCON. If you thought you were at GrafanaCON, I apologize. GrafanaCON is.. where's GrafanaCON next year? Barcelona next year, and it should be fun. You're all welcome to join. GrafanaCON is a little bit more of our open source centric event where we talk about really the LGTM stack, all the open source software that we're developing. Grafana Labs as a company, we're absolutely an open source company. We really believe in open source and we'll talk more about that in a little bit, but we put on a tremendous amount of events every year. I think in the last year we've put on several hundred events all over the world. We're a, we like to say we're not a multinational company so much. It's a post geographic company. Sounds cooler, more futuristic, but it's one of my great joys is being constantly on the road and visiting Grafanistas, 1500 Grafanistas all over the world in over 40 countries. I live life on an airplane and I really enjoy it.

(00:03:02):

It's been a tremendous last year at Grafana Labs. Quite a lot of accomplishments. I'd like to just share a few highlights over the last year. Within our open source community, which continues to grow, we're so privileged to be at the center of this really dynamic and vibrant ecosystem and community. At this point, there's over 25 million users of Grafana, and that is mostly open source. Over 99% of those users are open source. They don't pay us anything. That's by design. That's part of our business model. But we do have over 7,000 customers, which is pretty incredible. There's about a million companies using our software. So if you do the math, you can quickly calculate that less than 1% of the organizations in the world are customers of Grafana Labs. Again, that's by design, right? Speaking of customers and customer revenue, we reached $400 million in annualized revenue last year. This is really a incredible accomplishment. Now, who here is a customer?

(00:04:09):

That's really cool. We really appreciate you helping us get to that number by the way. And we'd like to get to 500 million as soon as possible. So I know we have a few of our account team limited salespeople around, so please help us get to $500 million relatively soon. Half a billion would be pretty cool. We've also seen good recognition from organizations like Forbes. We got on the Cloud 100 list this year for the fifth straight year, and we moved up 10 spots to number 13. Same thing with InfraRed. They ranked us as a top transformative company in cloud infrastructure. But the thing that was kind of a pretty big deal internally for us just a few months ago was the change in position within the Magic Quadrant for Grafana Labs in the category of observability. Now, this is the second year we've been in the Magic Quadrant, but this year we saw a dramatic change in our position.

(00:05:01):

We moved up and to the right pretty significantly and we're really proud of this and this moment in many ways has been sort of 10 years in the making for us. We started with this stack of open source software, started building Grafana Cloud, and it's really quite cool to be all the way to the right and ranked by Gartner as sort of the best in terms of completeness of vision. A lot of companies clustered around there that we really respect and admire and honestly continue to do so. Alright, so what we're building, what Gartner's recognized, what we're talking about all today, talking about today is really Grafana Cloud, which is the open observability cloud. So the key word here is open, and we don't just mean open source, of course, we're an open source company. We build a lot of open source software. Probably 80% of the engineering work we do is open source, but it's also about open ecosystems, right?

(00:05:58):

We talk about Big Tent, bringing data together from no matter where it lives, which is pretty unique in the observability world. We don't require you to store your data on our platform. Open standards, also very important, right? Obviously we were involved in projects like that we didn't even create, projects like Graphite many years ago. Prometheus more recently. And these projects are projects that we contribute to upstream. We really believe in the ecosystem. We try to be good citizens. And now most recently, OTel, Open Telemetry, is such an important standard that we're really participating in and the stats in terms of our contributions to OTel speak for themselves. And then lastly, open culture. We're extremely transparent as a company and we try to be really upfront and direct with customers, community. Hopefully you feel that when you interact with us. And all of this also ties into the way that we build software, right?

(00:06:55):

We're really inspired by open source companies. We're really open source projects, right? If you look at projects like Linux and Kubernetes, these are world-class projects that were developed by a group of people around the world who are highly motivated to work asynchronously and really develop world-class software. And that's the inspiration for how we do it at Grafana Labs. So there's three themes that we're going to talk about today related to Grafana Cloud, our open observability cloud. One is SaaS economics. Raise your hand if you're frustrated with the ballooning bills of many observability vendors. Yup, so this is top of our mind and we're going to talk more about this today. Complexity simplified. How do we use everything that we have, all the data that we have, all the tools that we have to simplify what we acknowledge is an increasingly complex environment. And then finally, actual useful AI.

(00:07:52):

And the emphasis here is on "actually useful." Grafana Labs as a company, we're kind of allergic to hype. We're kind of allergic to buzzwords. We wanted to make a lot of noise sort of a year or two ago about AI, AI, AI. We wanted to access the budgets for people that had the mandate to just spend money on AI, but we didn't have anything that we really felt good about that would really provide a lot of value beyond hype. And I think that's changed and we're really excited to share that with you. So without any further ado, I am behind time. I would like to welcome to the stage Tom Wilkie, our CTO and Sean Porter, our Distinguished Engineer to tell you more about these topics. Over to you, Tom and Sean. Thank you so much.

Tom Wilkie (00:08:41):

Thank you, Raj. Alright then, we are going to talk about re-imagining SaaS economics Sean. Awesome. Brilliant. So as Raj gave you a bit of a spoiler, you all know your observability costs and the volume of the telemetry has really been rising exponentially, right? And the common 80/20 rule applies, right? You're not actually using 80% of the data you're sending to your observability systems. I think this is something that annoys us. We are actually our own biggest customer. We run a massive instance of Grafana Cloud just to monitor the rest of Grafana Cloud, and we felt this problem ourself. So I think we are one of the few, if not the only observability company out there to really prioritize this and use not just clever pricing and weird packaging to help you, but also to use technology to actively reduce your bill and optimize the value you're getting from Grafana Cloud. So I'm not an engineer anymore, let's face it. So I'm going to hand over to a real engineer who is actually building this stuff and let Sean take you through it.

Sean Porter (00:09:43):

Thank you, Tom. Awesome. So the grand vision that we're working towards is what we're calling Adaptive Telemetry. Basically a world where we've pruned away the noise and extracted the signal from the spewing data from your systems. The fire hose..

Tom Wilkie (00:09:59):

Very graphic.

Sean Porter (00:09:59):

Oh no, I mean you got to imagine it. So yeah, basically tapping into that fire hose and observability data. This vision is only made possible through the use of AI and machine learning tools. These tools and capabilities allow us to determine the behavior of your systems and your users at scale. This allows us to make informed decisions of what we can keep, drop, aggregate, and so forth. Today I have the pleasure of introducing not one but two new additions to the Adaptive Telemetry suite. Tom, we're running out of signals. We're going to have to invent another one, basically. But before we dive in and explore these new products, I want to take a moment to highlight the success we've had with our existing Adaptive Metrics and Adaptive Logs offerings. So we've been continued to being amazed at the success of these things. Adaptive Metrics continues to have growth in terms of raw adoption as well as the number of series saved.

(00:11:01):

Is that a B? It's a B. We're getting there, Tom. A billion. We're getting there. We've seen a two x increase in the number of organizations using it.

Tom Wilkie (00:11:09):

That'd be like a hundred million revenue if we didn't do this.

Sean Porter (00:11:12):

I know. Well don't tell the investors that. So from last year we were doing 640 million saved series a month. Now we're doing 1.4 billion with a B. That's a big B. So the two most prominent features we ship for Adaptive Metrics this year were label based segmentation and auto applied recommendations. So effectively fine grain controls of influencing how Adaptive Metrics improves and maximizes your environment. And then automation to maximize your cost savings. So you don't need a human being going in there to make changes all the time. You can trust this intelligent, smart expert system to maximize your savings. Adaptive logs became GA last year at ObsCON and it already serves over 380 organizations and together they've dropped a total of 12 petabytes of less important log events. So they're saving our planet one log event at a time by using Adaptive Logs. So let's get into the new stuff.

(00:12:19):

Let's talk about Adaptive Traces. Tracing is powerful. It provides a complete, end-to-end view of requests as they traverse complex and distributed systems. Essentially, this is a signal that gives you that really deep visibility across the whole stack. Unfortunately, the sheer volume of trace data can be quite overwhelming and cost prohibitive, potentially worse than logs, believe it or not. Most of those traces represent successful operations. The real insights come from a very small percentage representing errors, performance issues, and unusual patterns. So what does Adaptive Traces do? Well, it only keeps the valuable traces, the traces that are worthy of your attention. So you send all of your distributed traces to Grafana Cloud and Adaptive Traces will intelligently sample them. Doing so will result in immediate cost savings while empowering engineering to better utilize traces as a signal. So at the very heart of Adaptive Traces is sampling policy.

(00:13:27):

This is what's being used to evaluate, to determine what to keep and what to drop, et cetera. Thankfully, it's really easy and straightforward to get started with Adaptive Traces because we have this recommendation engine. It will get you started with three traces in under three minutes or three traces. I hope so, three policies in three minutes go. Yeah, that'd be one heck of a reduction though. Wouldn't it? From there, Adaptive Traces will continuously analyze your trace data as we ingest it and then give you further fine tune policies to really maximize the efficiency of the system. Really fine tune it. And it's unique to you. Furthermore, again, on the AI/ML side, but we're using these technologies to also do anomaly detection and automatically capture the relevant traces that relate to them. All of this is in service of keeping the traces that deserve your attention that are worthy of it.

(00:14:25):

Now, our preview users have experienced an average of 70 to 90% reduction in overall stored traces using Adaptive Traces. It's quite impressive. What's also really interesting, Tom, is that they send us two or three X the amount of traces once they start using it. So, brilliant. You can imagine what the TCO numbers look like for this thing. So it wasn't enough to just ship one solution, we had to outdo ourselves and ship a second one. This is Adaptive Profiles. Profiles, as a signal, has a ton of potential. I think it's highly underutilized today. It's the only signal type that can point to a single line of code and say this is a potential problem or an opportunity to optimize. So you send all of your profiles, whether it's from the Pyroscope SDKs or Alloy to Grafana Cloud and Adaptive Profiles will intelligently sample them. I know I'm starting to repeat myself, but sub traces for profiles. Adaptive profiles uses AI to perform continuous analysis of these profiles.

(00:15:29):

And when something changes in the shape of the profile, we'll dynamically increase the resolution for the profile. So, by default we have a baseline sample rate. When things got weird, we increase that sample rate to get a higher resolution. We don't compromise generally the shape of the profile, but sometimes you need those accurate numbers, the telemetry from it. By turning this thing on, you're going to get an average of an 85% reduction in stored profiles. This is a big deal. We think this is going to unlock continuous profiling at whole new scales. You'll be able to deploy it across all of your services. You'll get that a consistent experience and the ability to go down to the code level everywhere. So if you want to learn more about Adaptive Traces and Profiles, be sure to join the breakout session tomorrow morning called Smarter Observability with Grafana Cloud. Now, before I hand it back to Tom to talk about Bring Your Own Cloud. I just want, it's nothing bad, Tom.

Tom Wilkie (00:16:27):

This is not in the script.

Sean Porter (00:16:28):

No it's not, I've gone off the script now. I just want to say it's been a year since I joined Grafana Cloud as part of the tail control acquisition and it's been a real privilege to join the Adaptive Telemetry team and we shipped two, but we're at a signals now, so we've got to add and invent some more and it's really just the beginning.

Tom Wilkie (00:16:47):

How long till you build Adaptive SQL? No, please. Thank you Sean. That means a lot. So yeah, Sean said, I'm going to tell you a little bit of a story about this new product we've launched called Bring Your Own Cloud. It's a bit of a weird one, right? We designed and built Grafana Cloud, our open observability cloud, to be really cost effective even at scale. We designed things like Adaptiveelemetry to mean that you don't just linearly increase your bill as you send more and more telemetry to us. And then we have really aggressive volume discounting. As you send more and more data, our team will charge you less per unit data. This line should really be a curve.

(00:17:28):

What we found though is last thing here is we've got really good at running Grafana Cloud at scale. We're now running 25 regions of Grafana Cloud around the world. Our biggest regions handle over a billion active series now. We do this with really tight control over our costs as well. Like internally, we dog food Grafana and Mimir and all of our technology to really aggressively monitor the spend on our own services. And all of this means we can pass these savings onto you and that's how we achieve the kind of prices we have. But despite all of that, there's ever bigger and bigger users of our technology. We can do billions of series now. Someone wants tens of billions of series. Those users, they tend to find it's more cost effective to use our open source software. And that's really, that's fine by us. That's kind of why it exists, right?

(00:18:19):

At the very large scale, they can run this themselves. They don't have to pay our margins. There is at smaller scales, there's this fixed cost. You've got to build the expertise and the team to run this software. And it's not trivial. They also don't get the kind of investment that we've made in our operational capabilities of running this stuff at scale and efficiently. They have to develop that themselves. And we're also kind of not particularly happy that they don't get the innovation that we've put into Grafana Cloud. They don't get things like Adaptive Telemetry. They don't get the observability solutions that Myrle and Manoj are going to talk about. They don't get the useful AI that Matt and Dmitry are going to talk about. And so we wanted to find a solution for these users, these absolutely massive scale users. There's not a lot of them to be really clear.

(00:19:06):

I don't know whether any of them are in the audience to be honest. But those tens of really, really big users where the traditional SaaS economics start to break down. So this is why we've built what we call Bring Your Own Cloud. This is where we, Grafana Labs, will come along and run a region of Grafana Cloud in your AWS or Google or Azure account. This is where you pay the hardware costs or the infrastructure costs from the cloud provider and you pay us effectively a fixed license fee. And this is really moving away from the traditional consumption-based economics of a traditional public SaaS provider. Fundamentally, and this is a non-consumption pricing model, and you get things like Adaptive Telemetry, you get our AI technology, you get all of the solutions we've developed. This is a full Grafana Cloud region in your account just for you. So we're really excited about this. I expect this world, we will only have a handful of these Bring Your Own Cloud accounts. This is not something that we'll have thousands of, I hope not at least because my team has to operate these. But yeah, this is really only aimed at those massive, massive customers. But this is one example of how we are trying to reimagine the traditional SaaS economics.

(00:20:19):

The economic side of public cloud is not the only reason why people can't use Grafana Cloud, right? There's also kind of some customers with particular compliance and particularly strict compliance benefits. For these ones, the most strict compliance is probably FedRAMP, right? This is for selling to the American government. We have just launched Grafana Federal Cloud. This is our FedRAMP compliant environment running in GovCloud. We've done this in partnership with Palantir. We've got FedRAMP High and DoD IL5 compliance.

Sean Porter (00:20:53):

What's that mean, Tom?

Tom Wilkie (00:20:54):

I have no idea. Yeah, I'm sure the team know. I hope so. I hope the team know. This is an interesting one actually because some of the requirements here are that the operations of this cloud have to be done by US citizens on US soil. So we've had to change some of the ways in which we run Grafana Cloud for this region. And the other thing I find fascinating about FedRAMP is it actually doesn't matter if you sell to the federal government or not. If you've got a customer that sells to the federal government, they're going to want FedRAMP themselves. So it has this kind of viral supply chain effect, this compliance. And Raj has this famous saying for all of these kind of things where this just makes us a better business. This just makes us better. This extra kind of rigor we apply to our security and our InfoSec makes it better for everyone.

(00:21:40):

So the last thing to mention about FedRAMP that I kind of like is what we are seeing with our initial FedRAMP customers is they're not going all in on FedRAMP. They might have 90% of their deployment on Grafana public cloud and that 10% on FedRAMP. And this helps them maintain that kind of consistency of operations so you don't have to learn two different tools. So yeah, that's a wrap up really. I'm going to hand over now to Myrle and Manoj, but briefly, like Adaptive Telemetry. We did it for metrics and logs and now we're doing it for tracers and profiles. Bring Your Own Cloud means really breaks the traditional consumption based economics of SaaS and allows us to kind of come and run an absolutely massive scale Grafana Cloud region in your Cloud account. And Grafana Fed Cloud allows us to really reach the highest and most stringent compliance requirements of anyone in the world. So yeah, I've been told to tease the breakout session first thing tomorrow morning. Yeah, if you want to. Yeah, Sean did that. If you want to find out more, please go to Sean's session tomorrow morning. And now I want to introduce Myrle and Manoj. Thank you very much.

Myrle Krantz (00:22:50):

Good morning everybody. Alright, we are going to be having a little conversation. This is I'm Myrle, this is Manoj.

Manoj Acharya (00:22:59):

I'm Manoj.

Myrle Krantz (00:23:00):

We are going to be having a little conversation about how we make complex things simpler. And the very first place we start is obviously with Grafana. This was the product that Grafana Labs launched with. It's the single pane of glass that we all know and love. And so from the very beginning we were making hard things simpler. And we didn't just build one tool. We also built Mimir, Loki, Tempo, Pyroscope, Faro, an entire forest of tools. I could make a joke here about logs. Like trees support their neighbors, we also contribute to projects outside of Grafana Labs. The best known of these are Prometheus.. fire in logs, I don't know. And OpenTelemetry.

Manoj Acharya (00:23:53):

Oh, we have been the cool kids in Prometheus, but are we really doing OpenTelemetry now?

Myrle Krantz (00:23:58):

Yeah, we moved up, we're up in spot four on the contributions for the last quarter in OpenTelemetry.

Manoj Acharya (00:24:03):

Oh my god.

Myrle Krantz (00:24:04):

We're doing pretty good.

Manoj Acharya (00:24:05):

We are the cool kids now.

Myrle Krantz (00:24:06):

And we're proud of it. We pride ourselves on making the best open observability tools on the planet. And this continuously adapting forest of tools is the sustainable raw material that we've built our business on. And not just fast, our users have made some really cool stuff with our tools including, and this is just a very small section of it, public dashboards for Wikimedia. You can monitor your FitBits with Grafana's tools. You can monitor your sourdough starter with Grafana tools or you can go really big. There've been multiple, at least three public space agencies and a couple of private space agencies as well that use Grafana to monitor their rocket launches.

(00:24:51):

Our tools are super powerful, and as our users have used them, we've also learned from our users. There's a huge range of use cases that we can apply this to, but just like most people don't want to build their own furniture, many of our users don't want to build your own observability solutions. You want to put your engineering wood behind other arrows. We've also learned a lot from monitoring our own SaaS offering. So we built the things that we wanted for monitoring on SaaS offering. We built that for y'all, including a hundred integrations. We now can monitor over a hundred different infrastructure systems, but that was just the beginning. We introduced Kubernetes observability for Kubernetes. We introduced monitoring for your web frontend. There, we had to build our own SDK for it. We introduced monitoring for applications. There, we leveraged OpenTelemetry and we've introduced monitoring for your cloud providers. Remember the "it's complicated" part. All of this leverages open standards and open source technologies. And each of these open observability cloud tools provides detailed insights into a critical component of your system. And as we identify new ecosystem niches in need of deep insight, we grow new solutions. Oops, there's one now.

(00:26:23):

We've got Database Observability. We are announcing it today in public preview. And this is the newest in our suite of open observability cloud solutions. We determined that many of the problems that our users were trying to solve have their roots in your database. And with Database Observability, you can see what the performance of your queries is. You can see if they're degrading, you can view explain plans, and you can even leverage AI to suggest improvements to your queries and to your tables. We're not going to go deeper on that now, but Manoj and Cedric have a session later. In like an hour, right?

Manoj Acharya (00:27:03):

Yep.

Myrle Krantz (00:27:04):

Where you can go really drill in on this one. Super interesting. All of these solutions are a little bit like drawers in a cabinet, right? If you want to find the actual problem in your system, you open a drawer and you look, you're like, is it Kubernetes? Are the scissors in the drawer on the right? You're asking your colleague, do you need to roll something back? That latency problem, is it an increase in requests? Have you under provisioned your Kubernetes cluster or has that increase in traffic, maybe exposed some inappropriate locking in your SQL database? All of these components that support your system.. they interact with each other. And if it's a compartmentalized solution that you're using to monitor and observe this, then it can give you deep insights into one component. But we've realized that you also need the ability to integrate your understanding of your systems.

(00:28:07):

And if you're looking at a single solution at a time, you may end up having to track a lot of information in your head in order to build a complete picture. And from the beginning, our customers have been using us, using our software to develop a full understanding behind the single pane of glass. So we felt like this was a natural next step to bring this kind of information also behind a single pane of glass. We started to integrate our solutions. If anybody here was here at ObsCON last year, you probably saw some of those early green shoots of that integration effort. So we went and we created a platform to integrate our open observability cloud solutions.

Manoj Acharya (00:28:50):

That sounds familiar, sounds like Asserts to me.

Myrle Krantz (00:28:54):

Wait, Asserts. We started with Asserts and we took that and we made it the foundation of our integrated solution and took out the pieces of that and made it universal across all of the solutions that I just told you about. We've got the Knowledge Graph that can show you a list of all of the components in the system and the relationship to each other. You can look at insights, you can discover the behavior changes in your system, and you can also go to the RCA, the root cause analysis workbench, and you can use that to discover the ways in which those changes in your system relate to each other.

(00:29:33):

These kinds of insights into these dependencies, they're critical to many of your daily tasks. So if you're trying to debug a time-sensitive issue, remember that latency increase issue we were just talking about. If you're trying to onboard new people into your system and you need a nice visual way to say, "Okay, these are the pieces, this is how they relate to each other." Or if you're trying to find a way to save money and you're saying like, "Oh, my cloud provider's got a really high bill, why is that?" This can all help you understand that. This is like a Billy Shelf. It's very versatile. Or Manoj, what is your favorite IKEA furniture?

Manoj Acharya (00:30:10):

I know Tom loves Billy, but I like EKTORP, the sofa.

Myrle Krantz (00:30:14):

You like EKTORP?

Manoj Acharya (00:30:15):

Yeah, I just want to chill there. I still have my 50-year-old EKTORP, the reliable one.

Myrle Krantz (00:30:22):

So when you put guests on that, right, you can look at all of your problems together and put them in the same solution and look at them together and see how they interrelate to each other. So it's like a good conversation on the couch.

Manoj Acharya (00:30:35):

Yeah, totally.

Myrle Krantz (00:30:38):

Our users have also been telling us about another hard problem that they're facing. They want to get their data into Grafana Cloud and they're still getting to know our products. So like IKEA's Billy Shelf, the problem is the assembly and users are also facing this problem while they're trying to find vendor neutral solutions. So we set out to make it much easier to ingest and instrument your systems, and this is why we engaged in the OpenTelemetry project. This is also why we offer an OpenTelemetry collector distribution called Alloy. And Alloy joins the best of the Open Telemetry and Prometheus worlds. To make Alloy deployments easier, we have introduced fleet management in GA, this was in May of this year, and our customers have since May been enjoying the ability to change and roll back configurations for their fleet of Alloy collectors.

Manoj Acharya (00:31:32):

But no, aren't you forgetting something? I remember being on stage with you two years ago for something called starting with B..

Myrle Krantz (00:31:39):

You mean Beyla?

Manoj Acharya (00:31:39):

Oh yeah..

Myrle Krantz (00:31:40):

Yeah, that was great. We launched Beyla in '23.

(00:31:44):

We found that Beyla has been really useful for helping users also discover what's on their system. So we win and we combine Beyla with fleet management and we are announcing today in public preview the Instrumentation Hub. So you can use Beyla and Alloy together with fleet management to discover the service name spaces and clusters that are on your system and see that from Grafana Cloud. And then if you want to instrument them, you can instrument them at a button push. So you don't have to search around and config files anymore. You can just say, yo, I want to see the data out of that system and say I want to see it.. push button.

(00:32:20):

And the other thing, the other really great thing about this is you can also turn it off. So if there's a part of your system that you're pulling data from into Grafana Cloud and you see, oh well we're not actually using that data, nobody's looking at it. Then in line with our philosophy around Adaptive Telemetry that Sean and Tom just told us about, we wanted to make it easy also for you to turn that off. We want you to have control over your costs. And Ed and Ted are going to take you on an excellent adventure later today through this new feature in the OTel and Grafana Cloud session. So, we have recognized that making something easier should never mean dumbing it down. You don't have the option of hiding from the complexity of your systems. You need the ability to drill down and we will not do you the disservice of trying to hide that complexity from you. We want to give you that holistic understanding of how your components interact in your systems. We give you the tools to navigate that complexity.

Manoj Acharya (00:33:22):

So, are you saying it's only for the experts or also for the experts? I don't know what I'm saying, but..

Myrle Krantz (00:33:29):

Well, tell you what.. I think it's probably best to understand if you just show us.

Manoj Acharya (00:33:33):

Okay. Okay, so me. Okay. Okay, let's do some live demo and let's see the demo goals are going to play nice on me or not. Let's go to the next slide here. Okay, yeah, it's showing up. Okay, so we talked about Knowledge Graph, right? So okay, please welcome the knowledge graph and I mean literally this is this group that we both work on. Observability group in Grafana Cloud, this is our menu item. We are a startup in a startup. So we have our own menu item. So let's start with entity catalog. Okay, so entity catalog is literally the newest addition to our product where you can actually see, I have Ed and Ted help me already on board, so we'll talk about that later. Now I already have my Kubernetes pods, nodes, et cetera, et cetera. I do have my CloudWatch stuff, your EC2 instances.

(00:34:21):

EC2 instance is talking to Kubernetes boards, and then I have some topics and some volumes. And then I do have the new kid in the town, the database, and it looks like I don't run my databases anymore. I just use whatever Amazon provides here. So it's called RDS. Okay. I have RDS too. So all cool shit, all the cool kits are here. Now let me, I have looking at my services right now and I can take a quick look at, okay, it just last five minutes. I will switch to last 24 hours here. Oh wow. Okay. Everything changed color and I can see my red metrics already spiking up and things changing color. So what are the stuff here? Okay, well this is solved by insights. Okay. So I mean this insights thing is pretty interesting, right? I get this question all the time.

(00:35:08):

So do I write my own alert rule or do I bring my alert rule? No, I mean we provide a lot of the box out of the box, a lot of stuff, but you can absolutely bring your own right? And we'll talk a lot more about that in the talk that we have later today. But I already have my funded service in the very top, but I want to know if this services are affecting my browser. Faro, Myrle's favorite. So I will switch to frontend. Oh, it looks like it does. My front end client is also affected by this. And then, okay, I'm going to add to my, this is my shopping cart. I'm going to just take a look at them, quick peek at them. So I'm going to add to the workbench and then I do want to look at my new kit, which is the RDS instance here.

(00:35:54):

Okay okay, hold on. The RDS instance is also complaining about CPU being high. Let's add it to the workbench too. Now let's just jump to the workbench to see, oh wow, okay, looks like, so this is like my flame graph, flame graph of all the insights. So let me zoom in and I already can see that, okay, all these problems happen around the same time. A quick visual cue that we can use to figure out. But I want talk about the new kid right now. I'm very biased. I will talk about the audiences instance. The audience instance. Literally your CloudWatch metrics now totally pulled into Grafana Cloud. You're able to see that. But not just that, right? I mean I can quickly see that. Oh, my CPU is high and my look at my disc, okay, click. Lot of reads are happening.

(00:36:43):

What I can jump into.. my SQL is running my SQL here and it's running a database here. And this is really, really powerful. You can look at the wait events, sort by durations, and I can then sort by what's the most expensive query here and take a jump into the query, go look at my query samples, explain plans, it just goes on and on. Even there's the AI helper to optimize their query. So we'll talk a lot about that across the board. Come to the breakout session starting in about an hour. But yeah, we'll walk you through all of this stuff, but this is really cool stuff, right? But let me go back to the workbench because I want to finish my troubleshooting here. So the Knowledge Graph or the insights, we literally took it everywhere to every corner of Grafana, every product of Grafana, every tool that you every day use, right?

(00:37:38):

Because I'm the expert and I could be.. or the new kid in the company, but I could be coming and trying to understand my system. So last the graph..hey dude, okay, can you just tell you what are the other possible issues here? And then it quickly brought it, walked the graph and fetched everything. I can go back to the graph here and say, oh wow, okay, I am running a microservice architecture and the databases and a frontend are pretty far away. I don't know how many [inaudible] who cares? It's just.. but go back to timeline here. But if I see here, I can start by time and score. I can see that, okay, it looks like something triggered here. Anyway, I want to do a full RCA later today, not right now. But the thing I want to really show you is that. I'm the tracing guy, like Sean was the tracing guy. He loves his traces. He only looks at traces, nothing else. He doesn't like my workbench. But I have a tool for him too. So now Sean loves the new view that we put into tracing. So where you can go to traces and you can literally see the insights. Hey, Sean now knows his ML/AI and everybody else now knows, hey, this is the problematic areas and it brought up all the bad traces to the very top. Thanks Sean. Keeping the bad ones. I don't care for the good ones as much. And you can jump in and do all your trace analysis. So it's literally in every corner of Grafana Cloud, right? In every product, every solution of Grafana Cloud now is integrated with Knowledge Graph and your insights. But then this is looking pretty busy. And then if you haven't following us in the different conferences, I talked about GrafanaCON, we announced Grafana Assistant there.

(00:39:22):

So we said, "Hey, how about we take the assistant's help to actually explain me who wants to read all the stuff? I want to sit and let it talk, right? Chill out there." So let the assistant do the work of actually reading all the data and analyzing everything. So what we essentially did is we took the, I mean the love tools, right? Mean they can, they're probabilistic, but they need some guidance and the guidance is call it RAG or whatever want to call it. But yeah, literally we gave it all the tools. The Knowledge Graph is literally a tool to Grafana Assistant now. So when I triggered this.. I'm almost over seven minutes right now of my scheduled time. So I'm just going to switch to an analysis that I just literally ran today morning. And you can see here I'm scrolling to the very top.

(00:40:11):

Yeah, so actually last night, midnight, I was running it just to make sure I'm ready for you guys. So it ran the whole analysis and showing me already what caused what a complete sequence diagram of all the failures. And I can see here, oh, it looks like the flag. Somebody turned on the feature flag. And around the same time, the databases had massive CPU spikes and connection errors and everything else. And everybody is affected by that. And as you scroll down, it's producing me.. I mean it can read logs, my research traces run Prometheus query. We have completely made sure that it has access to every tool, right? I mean this is a, I know there's a Grafana MCP server, which you all like, but this is like, I don't know, MCP on steroids, right? So think of everything we have done, all our products, all our tools is now fitting to Assistant here, but it's a lot of stuff. And we have a lot more cool kids in our town — Grafana town building AI [inaudible] and let me call the AI man, my best friend Mat, to show you some more. Thank you.

Mat Ryer (00:41:29):

Thank you so much. Wow, amazing. It's been a long time since I've been called cool. So I'll take it. Yeah. So I'm thrilled to give you an update on Grafana Assistant. We just saw a great integration example there with the assert stuff. There's a lot more exciting things to talk about too. If you don't know, Grafana Assistant is a sidebar chat integration. It's baked right into Grafana. This is actually a great use case for LLMs and this is why we think this is actually useful AI, we've got structured telemetry data and that's all your area information is. We've got query languages that can access and interact with that data. And now with assistant we have, you can do natural language to generate those queries. It can then access the data and because it's agentic, it can have a look at what it found and iterate from there.

(00:42:24):

So we found this to be extremely useful and let you say, because this is Grafana, of course it can build dashboards, it can actually help you do things. So this is a new way of interacting with your telemetry data. We saw the asserts integration, there were some other fantastic integrations with the drilldown apps. We saw the Knowledge Graph and more things to come. Grafana Assistant is for everybody. So if you're new to Grafana, maybe you just need to learn concepts or if you are onboarding new teams, that can be a challenge. But this is a great way to enable that self-serve and let them do that themselves. If you're migrating from another service, which we like, then you can see how concepts map, how do I do this in Grafana? I used to do it like this. All this stuff, it just becomes available.

(00:43:17):

You can just ask it these questions in natural language. If you are SRE Pro, then you can customize assistant now and take full control of how it approaches problems. So this is a great way you can give it particular insights about your infrastructure or just basically the best practices that you've built up and then what advice you give it. That's the advice it will give on to people and sort of enable this self-serve here. Non-technical people, I think I originally didn't put muggles. I put non-technical people. I think Tom Wilkie likes to go in and change my slides without telling me sometimes. But yeah, this is also like if you are non-technical, you can now for the first time really ever interact.. apart from dashboards. You can interact now with telemetry and get questions answered just through natural language. So everybody can now start to unlock some of this value.

(00:44:17):

And we've been very, very pleased and excited to see all the ways that Grafana Assistant is already helping people. Just check out this graph. And that's wide screen. Actually that spikes much nicer. So yeah, we've had great stories. There's one from Chris here that I liked because he was talking about if they had this at the beginning, they would've cut their support in half because most of the questions they get are the same kinds of questions people ask them in slightly different ways. Well the assistant doesn't mind that. It can just deal with it and take it in its stride. We've got Jeremy also coming up in a bit from SpotOn who's going to tell us a little bit more about that. Sean loves traces and people love traces and profiles. We've added support for that. There's, there's a lot that's new here in Grafana Assistant.

(00:45:11):

We're ever expanding its capabilities. We've got SQL data sources also support there, but I'm going to pull out a couple of my favorite things here, which I think you might like too. So rules let you control how assistant behaves and what it does and the advice it gives. You can tell it interesting things about perhaps nuances that you have in your setup. We use that then when thinking about other problems or if you've just spent a long time building up, this is the right way to do things in our case that's unique to you.. you can feed that in through these rules. Assistant will take it into account whenever it's answering questions or trying to help people. So, full control here and you just write this in natural language. You can also now connect MCP servers. So if you want to interact with third party systems, GitHub has a good MCP server, quite a big one. A few ticketing systems now have them. I think AWS has one. And of course you see them popping up all the time. So we expect more and more to come. And it's also quite cool if there's something really specific you want to do with Assistant in your world, in your telemetry, you could just build a little custom MCP server and plug it in and then those tools will become available to Assistant. So then it can do anything. It can interact, can take action. You can fully integrate Assistant into your workflows today.

(00:46:46):

A cool innovation that came from the team was infrastructure memory. So this is where agents will scour the telemetry beforehand and build a map of what it sees. This then helps Assistant answer questions faster in the future. The agents will consult this memory when making decisions. You can supercharge that with the integration we talked about with the Knowledge Graph, with the certs. And yeah, you just see Assistant learning and getting better the more you use it, which is kind of like fantastic. But don't just take my word for it. Of course I'm going to be saying it's good. Let's welcome Jeremy White to tell us more.

Jeremy White (00:47:34):

Thank you Mat. Hi everyone, I'm Jeremy White. I have the privilege of leading the platform engineering team at SpotOn. So at SpotOn, we help restaurants run successful businesses. We do this through our point of sale. We have a suite of other products that help handle things such as taking orders, processing payments, running marketing campaigns, managing employees and tips, and ultimately trying to become more profitable as a business. And our goal is pretty simple. We try to make sure that our restaurants are focused on their guests, not on the underlying tooling. Now for us, observability is a lot more than just about dashboards and alerts. Reliability is really how we make sure our customers are able to operate because we understand our restaurants depend on us. In order to have successful business, one of the worst nightmares that a restaurant can consider is having to close their doors for some reason and having to turn away their staff and their guests. And that's something that we aim to try to prevent from happening. And the challenge is bigger than just the cloud. We support hardware that's in thousands of restaurants and kitchens across the country. These are places where you deal with a lot of heat, you deal with a lot of grease. The kitchens are basically Faraday cages, so WiFi is not so great. So we rely on observability to help us understand and stay ahead of a lot of those problems in order to avoid interrupting their business and their day-to-day operations.

(00:49:04):

Now, observability only works when people are actually using it. And this is one of the areas where we found Grafana Assistant really helped us out. It's gone from having new engineers stepping in and not being familiar with where our logs and metrics are for all the different services and actively being able to participate in incidents and triaging. And again, this is all from basically having a conversation with an AI agent as opposed to having to go through all that training and understanding. It's also been helpful with dependencies. We have a lot of different products. So as you span into different products and have to understand how different teams have instrumented their systems, again, Assistant's able to bridge that gap a lot easier for people not familiar with those teams. One of the areas I really enjoy about it is it solves the blank sheet of paper problem.

(00:49:52):

This is the problem where sometimes it can be a little difficult to get started, but with Grafana Assistant, we found that some people could go straight from idea to dashboard and then iterate. And that iteration was a lot easier for people to handle as opposed to solving that blank sheet of paper problem. So this has been a huge boost to our participation in observability as well as the speed to value and how quickly we can get to value for our customers. One of my favorite examples of this is our client environment team. Now this is a team of network support specialists. Their goal is to try and make sure that the client networks are operating successfully and not impacting the business at all in any way, shape or form. Now these are not Grafana experts, not even familiar with the tooling, but they were able to use Grafana Assistant to get started and create dashboards. Dashboards that could look at one restaurant and be able to tell all the network devices, all the possible problems we might be seeing, pull it together in one place and allow them to triage those issues a lot faster.

(00:50:52):

But they didn't stop there. They kept moving even though they weren't familiar with the tool and started to build additional dashboards that moved more to a proactive nature. One where we could identify groups of customers that were experiencing issues that we could contact our support team and have them reach out to the customers and solve the problem before the customer called us asking for help. This is a great example of where our team can focus on being subject matter experts and focus on solving customer problems, not on the tooling. Sometimes though we end up generating so many signals that it can overwhelm some of our development teams and cause more reactive work than we expected. An example of this is for anyone familiar with Kubernetes is a cube pod crash looping alert. So this is a common alert that has a hundred different reasons for why it could fail.

(00:51:38):

And what we found is we can use Grafana Assistant to decorate that alert and provide additional context of why it happened. Because again, not all engineers are used to troubleshooting this particular issue. So what that did is we had one case where it even found a null pointer reference during the transformation process in one of our services. It noticed right when it happened during the release. And so it made it a lot easier and a lot less time consuming for us to troubleshoot that issue and ultimately get to a solution. So while it doesn't always pinpoint the exact example in the exact solution, it does help build confidence on what to look into and sometimes what not to look into because it rules some things out. So this has really helped us get to that resolution a lot faster. Overall, we're really excited with the capabilities of Grafana Assistant. It's really lowered that barrier of entry, allowing us to get more participation and engagement across our teams and observability. It's sped up that time to value by solving that blank sheet of paper problem. And ultimately, it's allowed us to really focus on solving customer problems and getting to customer outcomes rather than the underlying tooling. So if anyone else is interested in accelerating their observability adoption, I'd encourage you to take a look at Grafana Assistant. Thank you everyone.

Mat Ryer (00:52:56):

Brilliant. Thank you so much, Jeremy. That's brilliant. And Jeremy is a trustworthy man, in case you wondered, you can believe that. So we've built underneath, we've built this agentic observability platform, and this is what allows us to operate and run these agents at scale. And we're very excited about this because we are going to use it to build a lot of other very exciting new things that we are kind of inventing, which one of the most bits of fun that we have here. And we're kind of thrilled to unveil our first one now to do that. Please welcome Dmitry Filimonov.

Dmitry Filimonov (00:53:39):

Thanks Mat. Alright, hi everyone. So my favorite feature about Grafana Assistant is that it saves developers a ton of time. And this is particularly important during things like incidents when every second matters. And so Assistant is already pretty good at incidents. A lot of people use it for that, but we thought that we could push that idea even further. And so we thought, what if we could run multiple assistants all in parallel and make them autonomous, meaning that you no longer have to control it, and they would go and explore all the different theories about what might be going on with a particular incident that you are dealing with. Wouldn't that be cool? So we did think it would be cool. We also set it up to work with alerts so that you don't even have to start it, it starts automatically. And so today we're announcing Grafana Assistant Investigations. This is the first major addition to Grafana Assistant and I will now show you a demo of it. Alright, so this is our demo environment. You can clearly see that something is going on with the shopping cart service, it's in red, duh. So we can use investigations to try to figure out what exactly is going on. So I can ask it something like, can you help me figure out.. If you can't see the lower part of the screen here, it is just typing into Assistant. "Can you help me figure out what's going on with the shopping carts? Oh, I knew it was going to be that.

(00:55:14):

We need AI to help me type faster.

Mat Ryer (00:55:16):

Yeah, but this is cool. So it's just natural language, right? It is not really mentioning any telemetry here. Just talking generally.

Dmitry Filimonov (00:55:24):

Yeah, make sure to click this deep investigation button, hit send. And instead of continuing this Assistant conversation like you normally would, it actually starts one of these investigations. And so obviously the more context you add to this, the faster it's going to go, the more likely it is to actually find the root cause. But you could keep the PROM pretty vague. You could also ask it, "Hey, what is this red box? Can you help me with that?" Because it is on this page, it knows the context of where it is. It can read dashboards, it will generally be able to figure all of that stuff out. For this example, we actually know what's going on. This is our demo environment. We intentionally broke it. So the thing that's broken is this connection between the shopping cart service and Redis. So let's see if it will figure it out.

(00:56:13):

So we can go to the investigation page. And this is what the UI looks like. I'll talk a little bit about the key elements. We have some general information about when it started. What it is about, an interesting one is here, it's called confidence. So the idea is, as the investigation progresses and it finds more credible evidence, the confidence goes up. And this is one good way to assess, is this investigation even worth paying attention to? And so right now it is low because it just started, but as it continues, it will go higher. And so this is agent activity view. We show all of those different agents that I talked about. Now we run multiple of them in parallel. And you can see for example here we have a group of Prometheus specialists. And right now they're looking at basic things like error rate, request rate, your general red metrics, but also start looking into other things. And I see it's already checking. Red is connection health. It knows to do that because of those infrastructure memories that Mat talked about earlier. The way those work is we also run another agent on a schedule and it goes to all of your data sources and it tries to make sense of all of the components and also connections between them. And if you have asserts, we have access to the Knowledge Graph and then it becomes really easy.

(00:57:45):

Alright, so this is the interesting part, the key findings. This is the report that you get and as it progresses, the report is going to be updated.

Mat Ryer (00:58:01):

Yeah, one of the cool little things here is you see this reviewer agent at the bottom. It's looking at what all the other agents are doing and then that's how it's able to generate this report. So it's kind of reporting on what all this other activity is. So we've found that works really well and it gives you these insights quite quickly.

Dmitry Filimonov (00:58:20):

Yeah, you can see it actually already figured out that it has something to do with Redis and it was able to figure out that it's unable to connect. Yeah, good job Assistant.

Mat Ryer (00:58:29):

Woohoo. Yeah, we triggered that manually, but of course, probably you'd wire this up to an alert or something. This is all done by the time you've even reached the computer

(00:58:40):

After you've woken up.

Dmitry Filimonov (00:58:42):

Yeah and you can see the main thing is this executive summary, just a summary for you to kind of see what's going on generally. But then in addition to that, we provide more and more details and the idea is we progressively make it more and more detailed so that if you do want to go into details you can, but if you don't, you can just look at the summary.

Mat Ryer (00:59:03):

Yeah. One of the big concerns with LLMs always is how do we know it's not hallucinating? And so in Assistant, we always show all the working. Same thing here. So you can always see why did you come to that conclusion? Let me go and look at the real data underneath that's driving that. That's what lets you trust it.

Dmitry Filimonov (00:59:21):

Yeah and so this is the evidence panels that we have. We also have this timeline and this just shows the timeline of what happened when the incident started, when did we notice it, if there were any alerts, we'll talk about that. It will talk about how it found the issues or disprove some theories, things like that. We also have this detailed report and this usually has a bunch of charts that you would typically put on a dashboard once the investigation is finished. Speaking of dashboards, you can create a dashboard out of this. I'm not going to show it right now, but please join us at our session at 4:20 today at the end of the day. We'll talk a lot more about it and we'll give more of a live demo and talk about more features and things of that nature.

Mat Ryer (01:00:14):

Yeah, that's a very cool feature because it was basically prompted, we just prompt the Assistant to do something for us and then it can go off and do it. So it's really cool that we now can add quite complex features really quite quickly because what we're doing is essentially providing good context and prompting to the LLMs. This also has access to the rules and the MTP servers that I talked about earlier. So they all plug into this as well. So the way that investigates will also be influenced by your best practices and by your guidance.

Dmitry Filimonov (01:00:43):

And this is our tree view. So the idea with this one is as it checks all the different hypotheses, it will highlight the ones that turned out to be credible where it found matching evidence and things like that. And it will gray out the ones that turned out to be dead ends. So it's very useful to understand how the investigation progressed.

Mat Ryer (01:01:05):

And all the points in this throughout assistant and throughout using assistant investigations, you can give feedback and that feedback is very important. We are looking at it manually, but we also are using LLMs to process that as well. So more than ever, I think your feedback really is important in making the systems better and improving them. So appreciate everyone. And people are giving us great feedback, so we do really appreciate it.

Dmitry Filimonov (01:01:32):

Yeah, we made a lot of progress to Mat's earlier point and a lot of it is due to the feedback that we're getting. We really take that very seriously. And the other thing I wanted to talk about is the fact that you can help the investigation. If you already have some extra context or you wanted to do something or not do something, maybe say, "Hey, don't look at infrastructure metrics. I know my Kubernetes cluster is doing great. You can say it here and it will tell all the agents that you wanted to go in a certain direction." And I guess speaking of Assistant, Mat, do we have anything else to announce there?

Mat Ryer (01:02:08):

Yeah, we do. Good question, Dmitry. I'm glad you asked that.

Dmitry Filimonov (01:02:11):

I think we're done with the demo.

Mat Ryer (01:02:12):

Back to slides please, because I am very pleased to announce Grafana Assistant is now generally available. And we are going to announce the pricing also. $20 per active user per month. This starts January 26. If you're using it already, please keep using it for the rest of the year as a thank you. And yeah, we really hope you are going to get as much value as we do and as our other people do from this. Investigations is now public preview and we've got much more exciting things to come as well. So yes, thank you very much. Give it up one more time for Dmitry please, everybody.

(01:02:58):

Okay. Yes, we have the Grafana Assistant lab is outside so you can't miss it. There's people there. You can go and play with it. You can go and try Assistant. Try and see, ask it questions like anything. See what it does. I think you're going to be impressed and you should be able to also earn some swag. We just heard there from, well, it was a great lineup. I think we had SaaS economics reimagined there with Tom and Sean who loves traces, complexities simplified by Myrle and Manoj, and how to use the Knowledge Graph to make complex observability simpler. And then we talked about open, actually useful AI with Grafana Assistant and Investigations. So thank you so much for watching. Thank you.

Speakers

Raj Dutt
CEO, Co-Founder — Grafana Labs
Tom Wilkie
Chief Technology Officer — Grafana Labs
Sean Porter
Distinguished Engineer — Grafana Labs
Myrle Krantz
Engineering Director — Grafana Labs
Manoj Acharya
VP, Engineering — Grafana Labs
Mat Ryer
Principal Software Engineer — Grafana Labs
Jeremy White
VP of Engineering — SpotOn
Dmitry Filimonov
Principal Software Engineer — Grafana Labs

Opening Keynote

Speakers

Raj Dutt

Tom Wilkie

Sean Porter

Myrle Krantz

Manoj Acharya

Mat Ryer

Jeremy White

Dmitry Filimonov

Still have questions?

Get every update