
From dashboards to decisions: How agentic AI is reshaping observability
AI is transforming observability from a manual, expert-driven process into a collaborative, intelligent experience. In this talk, Grafana Labs Principal Software Engineers Dmitry Filimonov and Cyril Tovena will show how Grafana’s agentic AI systems help users write queries, run investigations, manage dashboards, and resolve incidents while only using natural language. They will also share the lessons they learned building context-aware agents, addressing coordination challenges, and measuring quality with benchmarks. Finally, they’ll discuss why a single-pane-of-glass foundation is crucial for a future of truly intelligent agentic systems.
Cyril Tovena (00:00):
Alright. Hello everyone. I'm Cyril.
Dmitry Filimonov (00:03):
I'm Dmitry.
Cyril Tovena (00:04):
So today we're going to talk about how Grafana Labs is reshaping observability using agentic AI. So before we start, I think this is the last session, so I want to make sure everyone is listening. So quick show of hands, who is using AI, whether it's cutting Assistant or ChatGPT at work, right? Yeah, so pretty much everyone, everyone. That's super cool. Alright, let me get a clicker. Alright, definitively AI is reshaping our industry. The time it takes to take an idea to production has been shortened extremely. It's kind of crazy. We actually deploy way more than before and deploying more than before means probably breaking more than ever before. And this is why we actually built Grafana Assistant. We want to make sure that it also match the speed at which you are developing your applications.
(01:11):
And the Grafana Assistant is basically here everywhere you are in Grafana to help you create dashboards, investigate an issue, create an alert, or just learn something new that you didn't know before. So it works for onboarding but not just onboarding. Also, if you're a pro you can just use some rules or MCP servers to get the maximum value out of it. This is the graph of contributions within the Grafana Assistant repo. I'm bragging here definitively, but this is the result of a world class team that we have at Grafana who are working on this. But it also shows the commitments that we're making to this product and we are working as a startup within a startup. We are definitively iterating super fast and we're using a lot of AI, to be fair. And we are actually inspired by a lot of tools that we are using.
(02:15):
What's interesting is that we release on average every 4th working day. So things that were not working before, something you were probably annoyed that wasn't working in the Assistant last week might actually be working next week. So we really want to make sure that you keep using it with us and provide feedback. I want to do a quick run of features that we've been adding since the launch six months ago. So you've seen the graph before, it was six months of work and it's been a lot. So first I'm going to start with MCP, sorry, rules. So rules are a great way of customizing your experience. You can set some best practices, special behavior, you can tune down maybe the amount of text that it returns. You can provide extra knowledge because maybe some companies have a peculiar way of instrumenting the application or maybe they use special labels in their Kubernetes clusters. So this is a great way to do that and we've listened to the feedback and that's how we added, that's why we added rules. MCP is another one that allows you to connect to external services like GitHub maybe. it could be linear — if you are doing something suddenly you want to maybe create an issue. If the Assistant is telling you, well there's a gap in your metrics, maybe you should have a new metric there. It's going to help you to basically create an issue directly from there.
(03:50):
Data sources is another one. We're like Big Tent, so there's a lot of data sources that we want to add. It takes time to add each data source because we need to engineer around the format that it returns. So we started with profiling and tracing because those two are quite niche. We realized there's a lot of people that don't necessarily know how to read profile or to write TraceQL queries. So we want to make sure that you can leverage those that you're sending to Grafana Cloud. And then we also added SQL. So SQL actually support many different SQL databases, not just like one. So I think I'm thinking about ClickHouse, MySQL, Postgre. I think we have also BigTable and Microsoft MSSQL. So it's nice to be able to leverage those tools within the platform.
(04:36):
Integrations is also a big one that we've been working on a lot. We actually built a whole SDK internally so that each team can just contribute to the Assistant and add more integration like Asserts did. We've seen it today. There's a lot of integrations within the Drilldown apps too and k6 and obviously we're not going to stop there, we're going to keep adding new things. This is definitively not over. We added many things. I'm not going to show it now. We're going to do a long demo. So we're going to talk about this after. And the reception so far has been amazing. The amount of feedback that we're getting and the amount of users that we are getting is awesome, but this is not the end. I think we are not going to stop there, right Dima?
Dmitry Filimonov (05:22):
Absolutely not. In addition to that, we are doing more. Our vision is to go from just providing you with copilots to providing experience that feels more like having more teammates on your team. So for example, earlier today we talked about this idea of running multiple assistants, having them be autonomous and having them attack the problem from different angles. This is what good teams already do during incidents and so we want to provide more of these use cases. And so earlier we announced Grafana Assistant Investigations, and we're going to do more. And now let's go into the demo. Shall we?
Cyril Tovena (06:05):
Yeah. Can we switch to the, alright, let's get started. This is the Assistant on the page. You can access it from the nav menu here. If I can click right so it's right here and you can open the Assistant using this button here. So the Assistant is always there to help you on every page and it knows about the page you are on and it has the context of where you are. So we're going to quickly start the first prompt, describe my infrastructure and provide me with a diagram using metrics. So it's going to use tools to look at what I have in my infrastructure. Notably like metrics, it's going to search metrics, it's basically doing what anyone would be doing, searching for metrics and then querying metrics.
Dmitry Filimonov (06:57):
Can you maybe show us how rules work?
Cyril Tovena (07:00):
Yeah, so this is going to take some time. So we're going to look at rules. You can access the rules from the menu here at the top. So let me show you the rules. So we have a bunch of holes here. What's your favorite rules, Dima?
Dmitry Filimonov (07:13):
Probably runbooks. For our team, we just put everything we had on GitHub in one of these rules. Although to be honest, the infrastructure memory sometimes generates even better stuff. And Knowledge Graph, if you have that then it gets really, really good.
Cyril Tovena (07:32):
And those rules work for everybody. So you can set rules just for yourself because you like to have a different tone, but you can also set it for your whole team if you have best practices and you can create as many rules as you want. So it's done looking at our metrics and it gave me a diagram so I can open it in big. So this is a lot of microservices. Whoops. So we have a lot of microservices. Shows me a nice diagram and what's nice is that it gives me also a quick description that I can look at. So 20 services are running, it's giving me how many AWS regions I'm using and if I like, I can just go and share this with my colleagues, so I can just generate a link and share that link with everyone and it's going to go as a full page view and I can share that with everyone in the company. Right? Let's go back.
Dmitry Filimonov (08:34):
What are those suggestions at the bottom?
Cyril Tovena (08:36):
Right. So interestingly, because I ask about the infrastructure and every time, based on the context, it's going to generate some follow-ups or links within Grafana that are nice to go and check. So for instance it's suggesting me, so someone created a system of review dashboard. I ask about the infrastructure so I can just right from here click and look at the dashboard that someone created here. And it is also suggesting us another follow up comparing the checkout CPU. As you can see the checkout services seems so like a problem with latency. So I can just click on it and then off you go. And then again it's just going to use tools to answer the question about how is the CPU doing for the checkout services.
Dmitry Filimonov (09:24):
Maybe we could show MCP servers?
Cyril Tovena (09:26):
Right? So MCP servers, so as you can see I have a couple of MCP servers. From there I can click on it and I can see the MCP servers that I'm using. So I've set up GitHub. So we have a pull request and repository, but you can add as many MCPs as you want. And Mat said today, if you want to create your own MCP servers, also fine, and then that's going to allow you to connect it to maybe your Kubernetes cluster or another service that you're using internally. So already we can see the graph. There is one region that is a bit of an outlier, there's more CPU usage. It's not that much to be fair, but still, something is going on and the Assistant here is telling us, well there's one that is significantly higher. So I could go on and just check maybe the request way to verify if one region is maybe receiving more. But I'm just going to go and verify the CPU because I was asking about CPU, it is suggesting me to look at profiling. So I go in profiling and to be fair, I don't know how to read profile. This is definitively a new signal for me. I think it is for you too, Dima?
Dmitry Filimonov (10:33):
Yeah, I usually go to the profiling team if I need help with that.
Cyril Tovena (10:38):
Didn't you cofound Pyroscope?
Dmitry Filimonov (10:40):
Oh maybe.
Cyril Tovena (10:42):
Okay. So as I was saying, we built a SDK and every team is basically adding integration to send you to the Assistant when you need help. And that's what we're seeing here with this analyze flame graft button. So I'm going to click on it so we have it here. We have it also for tracing, we have it for alerts, data source connection issue. You'll see more coming up and we're going to keep adding those integrations. So what I just did basically is ask the Assistant to look at that profile for me and tell me what's going on.
Dmitry Filimonov (11:14):
Can you tell us more about those rectangles? Is it context?
Cyril Tovena (11:18):
Yeah, so the context here, so you should think about this is the same as if you're using Claude Desktop or another LLM tool. More context the better. So those pillars here are the most used data sources. So every time you ask something like if you select the data source then it's going to go faster. But you can also create context using the @ mention and then you can select the type of context you want. If you don't like keyboards, you can also just use the menu here. So the Assistant came back and is telling us as you can read here that the regexp.Compile seems to be the biggest problem here and compiling regexes is like a classic Go issue.
Dmitry Filimonov (11:58):
Oh yeah.
Cyril Tovena (11:59):
You should not do that in the hot path. We probably have an issue here. So I think we're just going to ask the Assistant, right? I have MCP we should give.
Dmitry Filimonov (12:09):
Yeah, can I just fix it somehow?
Cyril Tovena (12:11):
Yeah, so create a PR to fix this issue and give us a link to the PR. Right, so now it's going to try to use MCP to figure it out. What's going on? Alright, or what's going on?
Dmitry Filimonov (12:29):
What are those orange buttons?
Cyril Tovena (12:30):
Yeah, there's a warning here. So yeah, MCP can be destructive, right? We don't know maybe depending on where it's going to connect to. So what we do is we actually add permission that you want to allow this one or not. I'm just going to YOLO it and just accept everything. This is like vibe troubleshooting. I think Raj coined that one by the way. So now he's looking at my code so I have the rules to tell it where is my code so he knows about which repository to use and all of this. Now it's creating a branch and then next it's definitely going to try to fix this issue. It's probably time to go for a coffee Dima.
Dmitry Filimonov (13:11):
Yeah. Do you drink coffee?
Cyril Tovena (13:14):
Yeah, I'm French. Yeah.
Dmitry Filimonov (13:16):
Alright, yeah, I'm more of a matcha guy myself these days.
Cyril Tovena (13:22):
Matcha — crazy. In the meantime, while it's doing that, if you want to find your previous conversation, you can go in there and just find conversation. So I was looking for conversation earlier about how to collect Kubernetes logs and as you can see it gave me some sort of snippets that explain how to set up your Alloy to send those events to Grafana Cloud. Alright, let me go back. If it did it, that would be awesome. Alright. Right, we have a PR — woo, that's pretty great. I think we should give a shout out to the Assistant.
Dmitry Filimonov (14:00):
Well let's check to make sure it's legit.
Cyril Tovena (14:02):
Let's check. You're right, it might actually be a hallucination so let's go for it.
Dmitry Filimonov (14:06):
Maybe you just made a rule to always give you a PR.
Cyril Tovena (14:10):
Alright, so first it gave me a PR with such a good description. For sure, if Mat is looking at this PR, he's going to say you didn't write that, and I didn't. Yeah. What do you think about the fix, though? Let's have a look.
Dmitry Filimonov (14:22):
Let's see. Yeah, this is a classic Go issue — instead of initializing these regular expressions once at init time, it looks like it is doing it in a loop and so it's probably doing it like millions of times and that's probably what's using all the CPU and giving us that chart that we saw earlier. Yeah, we should merge it. Oh yeah.
Cyril Tovena (14:43):
So this is probably going to give us a big gain in performance and also save costs, and this is quite insane. From in a couple of minutes with you guys, I was able to fix an issue based on profiling. If I go back to the dashboard, the overview dashboard here, I think there was another issue, Dima.
Dmitry Filimonov (15:02):
Oh yeah, the shopping cart looks red.
Cyril Tovena (15:05):
Looks red.
Dmitry Filimonov (15:06):
I don't know what that means though. I am not very familiar with this environment. I'm new on this team.
Cyril Tovena (15:12):
Yeah, I don't know either. And I want to show you how the Assistant can actually take a screenshot. So that's super fun. This is a new addition. So when you ask about what you see visually to the Assistant, sometimes it's actually going to get the dashboard just, but also it can just take a screenshot and try to infer what's going on. So this time I'm going to type, so what is this red panel that I see here? So again, it's always the same, it's going to use some tools. This time this is a visual tool to get a screenshot and it's going to tell us what's that panel about.
Dmitry Filimonov (15:51):
Can you maybe talk about the full screen view?
Cyril Tovena (15:53):
Yes, that's a good idea. So in the menu again, you have this full page view. When you click on it you actually get the chat. But as a full view, I really like this because I use it, you can actually resize the window and have just the Assistant next to your cursor or on another page.
Dmitry Filimonov (16:11):
Somewhere maybe on a phone..
Cyril Tovena (16:12):
On your phone. It's pretty cool on the phone. So you have a full page and you can just ask question directly there. So it's telling us the red dashboard is about a shopping cart, SLO going on, we should probably take a look at it. So we're going to use an investigation for that. Oh yeah, let's do it right? Investigate this issue for us. So I'm going to click on the Deep Investigation just to trigger it in the background. Something that I think I don't think we've been clear about, but everything that has happened in the conversation so far has been added to the investigation. So it knows about the SLO. So that's why I wasn't really prescriptive about what I'm trying to investigate, because it's already part of the conversation, and that's already inside the context.
Dmitry Filimonov (16:58):
Yeah, it also knows which page it's on.
Cyril Tovena (17:03):
So this is running in the background. I can go and do something else or I can just go and check out what's going on with.
Dmitry Filimonov (17:08):
Yeah go for coffee again.
Cyril Tovena (17:09):
I'll go for a coffee again. But you're going to be excited after.
Dmitry Filimonov (17:14):
Going to be very jittery.
Cyril Tovena (17:15):
Yeah, you want to tell us about what's going on on this page Dima? Let me make it bigger.
Dmitry Filimonov (17:20):
So at the top we have the description of the investigation, a name, there's a date. We also have this confidence number, and that one is pretty interesting. As the investigation progresses, the number will likely increase. Once the agents find evidence that supports some sort of a theory, it will go up. If they didn't find anything interesting, the confidence number will stay low and this is kind of a good way to tell if this investigation is even worth looking into. And so below that we have agent activity and this shows you in real time how the agents are exploring the problem. So, right now we see Loki specialists and they're looking into shopping cart error logs. Okay, that makes sense.
Cyril Tovena (18:11):
It's looking at Redis log too.
Dmitry Filimonov (18:13):
Oh yeah. And there's a lot of Prometheus ones and they're doing all of this in parallel and as they are finding evidence, they are starting more and more of these agents so that they kind of follow up on the previous findings.
Cyril Tovena (18:26):
Yeah, actually I wanted to tell a story, Dima. Yesterday I was talking with Tom and then suddenly Sven was next to us and he got a page. And I'm not supposed to be actually showing that, but we have a culture of asking for forgiveness, doing and then asking for forgiveness. So Tom, I'm sorry, I'm going to show a real investigation. So this is a real investigation that happened yesterday — Sven actually got the page seven minutes after which the page I think triggered five minutes after. So almost instantly, he got a report and was able to figure out that this was just a flake. As you can see here, it's talking about false negatives. So I found that super helpful as you are receiving a page going into the incident and already getting some sort of report. I think this is super useful.
Dmitry Filimonov (19:10):
Yeah, we use this a lot. I use this in bed a lot. I get alerts, I can just check them. I don't have to go to my computer or anything. It's great.
Cyril Tovena (19:20):
Can we maybe talk about the tree view, Dima?
Dmitry Filimonov (19:24):
Yeah, so the tree view shows you all the different hypotheses and it's hierarchical, so if one follows from another, there's going to be a tree, and it will highlight the theories that ended up being right and it will dim the ones that are dead ends. And so that's a nice way to see what's working, what's not.
Cyril Tovena (19:48):
Yeah, let's look at maybe one that was done yesterday and we haven't talked about this yet, but basically this one was triggered by an alert. As you can see here, the source is a link, which I'm going to click on it and it's going to take me to an alert page. So this is triggered automatically. So far we've been showing how we trigger it manually, but you can also, we're using Webhooks to automatically trigger that investigation. Alright, so this one is done. What's the Assistant here? What? What's the create dashboard?
Dmitry Filimonov (20:20):
Yeah, you can create a dashboard — that's very useful. Next time you have the same incident, you'll already have a dashboard. Assistant will be able to use it as well. Yeah, that's a good one. Let's do that.
Cyril Tovena (20:31):
Right? So we're going to create a dashboard from this and it's going to again use the context from the investigation. So that's going to be way faster than what you've seen at the lab because it doesn't need to search that much data. It already knows about all the metrics involved in that investigation. And so we're going to create that dashboard. We should talk about feedback. We have a way to give feedback, we love feedback.
Dmitry Filimonov (20:59):
All the changes we've had in Assistant and Investigations in the past months came from the feedback. I encourage everyone to leave as much feedback as you can. Except for you Jimmy, we know your IP address — stop sending us these mean emojis (starts with 127 by the way).
Cyril Tovena (21:21):
Alright, well that's it. We have a dashboard ready to go. Actually one shot that one. Sometimes it's going to make mistakes, you just keep following up. You can see the dashboard, remember. So you can ask more to refine that dashboard. I think that's it for the demo Dima.
Dmitry Filimonov (21:38):
Yeah, sounds good.
Cyril Tovena (21:43):
Do you have anything else?
Dmitry Filimonov (21:45):
I don't know.
Cyril Tovena (21:46):
Yeah..
Dmitry Filimonov (21:47):
We still have some time. Oh wait, what's that? Oh, Cyril, I think there's something happening with the Assistant.
Cyril Tovena (21:56):
Yeah, I can see it from here. It sounds like Sven is basically..
(21:59):
Got a Slack notification.
(22:01):
Yeah, Sven is saying that there's something wrong. Yes, he's not available, he's on the train.
Dmitry Filimonov (22:07):
Yeah, I'll just ask it here.
Cyril Tovena (22:09):
Yeah, can you ask the Assistant?
Dmitry Filimonov (22:11):
Yeah, yeah, yeah.
Cyril Tovena (22:13):
Right. This is the Slack integration that we are unveiling today that allows you to have the same experience but in Slack with the Assistant. So you can just talk to the Assistant and it's going to use the same kind of tools, give you a bit of the same answer and everyone can use it in the thread as you are investigating together and trying to find issues. So now it's actually looking at the Assistant namespace and checking if all the pods are healthy. So we're going to wait for that. Hopefully everything is fine, by the way. This is like inception — the assistant is investigating its own infrastructure, right? Everything is fine. This is good. Let me verify something. Do we have, we're not going to send a message to Tom now, do we have a dashboard for the Assistant? Yeah, I'm making mistake but it's fine. It can't even understand French. I should have asked in French, by the way. Searching dashboard. So it's going to give me links about the dashboard that are available for my team that I can use. This is great, but the cool thing is we can also ask to show me that dashboard..
Dmitry Filimonov (23:26):
And it uses all the context from this thread. So if there's already a conversation and people already found some leads, it will use them as well.
Cyril Tovena (23:39):
Yeah, right? As you are discussing in this thread and talking about leads, all of this goes into the context and the Assistant will be able to take them into account and investigate. So this is a great integration to work together without actually leaving Slack when you have an incident, and when it allows you to be able to react faster. Alright, this one takes a bit of time. Here we go. We have dashboard, we can look at it. Really cool. Alright, I think that's it. I think that's it. Yeah, we should go back to the slides.
Dmitry Filimonov (24:19):
Alright. Alright. And so our big vision is to build this agentic observability platform where you don't just monitor your services, but you can understand, predict what's going to happen and act on issues proactively and autonomously. That's why we're building all these autonomous agents, not just one, but a lot of them that will help you with all of your observability needs. These are just some of the things that are coming in the Assistant in the near future. A couple of my favorite ones is things like automatically find blind spots in your observability setup. We already do that in the investigations, but we could do more. It could suggest a new Alloy config that you could apply and start getting logs from services where you don't have logs. The other one is deeper integration with your code base. We have the GitHub MCP server, but we want to do more on that so that it runs faster and gives you better results. Customers tell us that they already see a lot of value from this and it's changing the way they work and they're very excited about this vision. And Grafana Assistant is generally available. We announced that earlier today and it's going to cost $20 per user per month. Just to be clear, it's active users and..
Cyril Tovena (25:59):
Yeah, I wanted to say something on the $20 per month, if you think about it, if the Assistant saves you just one hour, it's probably already good enough for the $20 because I don't know a single SRE that costs $20 an hour.
Dmitry Filimonov (26:18):
Yeah, totally makes sense. And Investigations are going public preview, so please check that out.
Speakers

Cyril Tovena
Principal Software Engineer — Grafana Labs

Dmitry Filimonov
Principal Software Engineer — Grafana Labs