Create a better reliability culture with AI + Grafana Cloud IRM, SLO, and Alerting

How do you build a reliability culture that scales and becomes one that’s proactive, repeatable, and resilient? At Grafana Labs, we believe the answer lies in cultural maturity, enabled by a unified, observability-native IRM platform that connects people, processes, and tools.

In this session Grafana Labs Product Director Devin Cheevers and Senior Software Engineer Sonia Aguilar will share how Grafana Cloud is helping teams move beyond reactive incident response with a unified platform for IRM, SLOs, alerting, and post-incident reviews. To reduce noise, accelerate resolution, and turn incidents into continuous learning opportunities, you’ll see how we’re embedding AI across the incident lifecycle in Grafana Cloud.

Detect: Alerts now include more observability context via alert enrichment, automatically delivered to Slack and mobile. We’ll show how this – combined with new IRM webhook and workflow features – enables custom, context-rich workflows tailored to your tools and teams.
Respond: See how the latest AI functionality and deep integration with Grafana transform incident response – surfacing related incidents, graphs, logs, and even recommended actions, keeping humans in control while reducing cognitive load.
Learn: Post-incident reviews are no longer a chore with Service Center, which automates tagging, connects themes, and turns scattered incident notes into operational excellence reviews, all in one place.

This is IRM reimagined for an AI-native world and built directly into your observability stack, not bolted on. Join us to learn how the right systems and cultural mindset can transform how your teams build and operate reliable software.

Devin Cheevers (00:00):

Hi everyone, I'm Devin, one of the product directors here at Grafana.

Sonia Aguilar (00:04):

I'm Sonia, one of the software engineers here at Grafana.

Devin Cheevers (00:08):

I know it's the last session, so hopefully we're going to bring it home strong. We're focused today in this session on talking about culture and specifically how to improve your reliability culture. We're going to focus on some of the improvements we've made to our IRM product, SLO and Alerting product and also show some of the AI features in a different light in the context of an alert. I know you've heard this a lot this week, you hear it a lot in the industry, but reliability is of course paramount for many people in this room. I know you use Grafana in part to build reliable systems and when an outage happens it can have major financial consequences. I was doing some prep for this talk and obviously there's a wide range of estimates, but some estimates calculated at up to $10,000 a minute for an outage.

(01:03):

So a 30 minute outage, that's a significant amount of money and I know we're all dealing with a lot of competing forces in terms of trying to build more reliability. On the right side there, you've got things like how to improve your communication during an incident. How do you do that externally? Internally you're obviously trying to manage performance issues just generally and you want to minimize downtime at the same time. Obviously you want to keep your team productive, ship more often, avoid data loss. And a new wrinkle, I know you've heard a lot about it this week, but we are all experimenting I think, or many of us are experimenting with Code Gen tools and that in some cases is helping us move faster, but it's also potentially increasing the risk there's code that's being written not by humans. Does that increase the risk of not having it easily debugged when an outage happens?

(02:03):

And so we definitely think this move to AI Code Gen is increasing the importance of observability and then better and improved incident response. And it's not just something that you can solve with money. Money helps, don't get me wrong, but it's more than that. So it's really about culture and one of the ways that we think about culture. I know it can mean many different things, but in this context we're breaking it into three components. People, processes and tools and all of these three things obviously play together. They intermix and we really believe that tools can help you hire great people. Obviously they can help you keep great people. You want to have the right processes and processes influence tools and tools influence processes obviously. And ultimately we are a tool builder as a company and we want to build obviously great tools for you to help improve your culture. So I'm going to do a quick, I think four questions for the room. So hang with me. See how everyone's dealing with some of these challenges that come around incident response. So starting with alerts. So who here would raise their hand and say, really we have no alert gaps, alert noise is a non-issue, complete non-issue.

(03:26):

Okay, zero. Incident process. Maybe this one's a little easier. So when you're thinking about incident process, and let's say I went to someone on your team, a random engineer on your team and ask them is creating an incident, I mean it's not going to be completely as easy as creating a PR or emotionally, but it's pretty close in terms of a pull request versus an incident creation. Who feels like their organization is at a place where creating an incident is pretty routine, mundane, in terms of just getting it started? Who here would raise your hand and say yes? Alright, I got 1, 2, 3, 4. Okay, four people. What about learning? If I again went to an engineer on your team and asked them, they we're doing pretty good. Most incidents, we do some follow-up items. We have a PIR post-incident review type process and we feel like we're actually learning from most of our incidents.

(04:23):

Who here feels like that's happening? Alright, that's probably the best in terms of count response, maybe 10. And then AI of course. Who here is using AI to do any sort of remediation automation currently? Okay, about two or three. So this matches pretty close to what we saw in an observability survey that we do an observability survey every year, and it's a fairly large survey. It's about a third of folks said the number one obstacle to incident response is alert fatigue. So that matched, I think everyone, no one raised their hand in terms of that being a solved problem. And then pretty close follow up in the next three. So about 17% just not having an incident response process. 16% was really focused on the coordination being the number one obstacle, pain across coordination and then about 15% on the process or culture on learning and improving from incidents.

(05:20):

So we think, and a lens that we bring to this is that fragmentation, in particular tool fragmentation, can amplify these problems. So during incidents, if your engineers are using multiple tools, they are switching context, they are using different UI, different ergonomics that can amplify some of this pain. The post-incident review, actually doing it. Sometimes that can be harder if you're jumping between different tools, trying to move data between different tools. And then sometimes if you're using certain point solutions in this space, you may be paying for users that don't even use the product and maybe ultimately funding features that your teams barely touch. So we believe that observability native IRM is uniquely positioned to help tackle some of these problems and we're one of the few vendors that actually offer this today in the market. We have an end-to-end platform, so we have alerts, SLOs, some features that Sonia is going to show in a minute that help you better detect issues on the response side with the Grafana Cloud IRM, you can route to the correct team, you can schedule to have a strong scheduling and escalation tooling. And then we're going to show off some enrichment features that we're delivering this week. On the more learned side and sort of actual response side, we have Assistant Investigations, which we've seen a lot of. We have some workflow tools and then we have obviously help around post-incident review processes and then using data to improve your culture with things like Service Center. So with that, I'm going to hand it over to Sonia that's going to show you a demo of these features and some of the highlights from the last six months of work we've done.

Sonia Aguilar (07:07):

Okay, so let's imagine a scenario, I'm the on-call engineer today and of course right when I least expected, we've got an outage. So, it's 4:00 AM and I just received a notification on my phone. Now I can show you here we have the IRM app and here we have the alert that just came in. We are holding off on declaring the incident because we want to investigate a bit how serious a problem is. Let's take a look at the labels. Okay, is the payment service, that's not good. Wait, here we have something new. We have some logs. This is a new feature. Now we have the ability to enrich notifications. We can enrich notifications by adding new labels, new annotations, or by querying metrics, logs, or external services. For this demo, I created an enrichment for this particular rule, adding these logs. This means that once the alert starts firing, before the notification is sent, we enrich this notification with this additional context, these logs.

(08:26):

So let's take a look at these logs. Okay. As you can see here, these logs are telling me that the payment service is experiencing several failures with Stripe. And this is impacting multiple regions. This information is critical for me to determine that I want to declare an incident. But before that, let me show you something else. If we scroll down here, here we have a link. This is an enrichment and this links to an investigation that already started in the background once the alert started fighting. What's really powerful here is that by the time I'm officially declaring the incident, I already have a head start on the investigation and this can save us critical minutes when customers are impacted. So I'm going to go ahead and declare the incident.

(09:33):

Let me copy from notes, description, okay? And we are going to add a customer's affected label and it's critical and let's incident created. Okay, I'm going to grab a coffee and open my laptop. So let's switch to the web browser. Okay, here we have the incident that we just declared. There's a bunch of things that we automatically create for you in order to help out with your workflows. For example, we create a post-incident review document that at this point it's just on a skeleton, but it's going to be felt out as we progress through the resolution. We also create a Google Meet room in order to have a virtual room. And lastly, we create a Slack channel for responders to share their findings. One important thing when running incidents is defining the roles. Because roles help clarify responsibilities. The whole team knows who is responsible for what. So I'm going to assign myself as a commander and I'm going to assign Brandon that I know that is very familiar with this service. And you can page people directly. I'm going to page Dimitry and you can also page teams. I'm going to page this team. When you are paging a team, you're actually paging the person who is on call in this team and these paged people are going to start investigating the problem.

(11:35):

Let's go to the timeline. Timeline is super useful for you new joiners to have a picture of what's going on in the incident, but we also have, and this timeline is connected with the Slack channel that we already created. So all the messages that you put in here in the timeline are pulled them into the Slack channel and at the same time the message that you put in the slack channel when someone reacts with the robot emoji, this message is pulled into the timeline. We also have the ability to track updates. Status updates are designed for stakeholders to be informed about the important milestones, important decisions, and important changes during the incident. So let's make sure we add our findings so far, our important findings so far and copy from the notes.

(12:43):

Okay, I'm not going to change the status nor the severity. Nice. Now do you remember that I mentioned a link to investigation that the started once, the alert started firing? Let's go to the links and context tab. Here we have the link to the investigation. Let's take a look at this investigation. Nice. As you can see here, the investigation is completed. This is good because we are very nervous and we don't have to wait for it. And this investigation is going to help us to understand the root causes and it's giving us some recommendations in this case. As you can see, confirm it, Stripe payment provider, timeout, cascade, and it's giving us some recommendations like implementing a circuit breaker. You can also click here to the assistant button if you want to follow up with more questions regarding this investigation and you can take a look at the details of the investigation.

(13:56):

Okay, so let's imagine that we have fixed the issue, we understand the root cause, we have some recommendations, but before coming back to our incident, let me show you a new feature. Now if you press command plus I, you can see my cursor now is a cross. We can do a screenshot of this because we think it's important and now we select our incident. This one and this screenshot is going to be attached to our timeline. This is a new feature. Nice. Let's come back to our incident. And here you can see the screenshot that we just took. Nice. So time to resolve the incident, right? But we believe that it's not just about solving the incident but also making sure that things improve.

(15:01):

We can create tasks. Tasks allow us to convert this incident to something actionable. So let's create a task and you can convert this task to a GitHub issue if you want to track it afterwards. Alright, time to resolve the incident. Okay, so here we have a cool feature is an OpenAI integration that is going to take the timeline as a context and is going to generate a summary for us. As I mentioned before, it's 4:00 AM, I'm tired, so I'm going to use this feature. Live demo, we have errors sometimes, but it's going to work nice. Let's append. You can edit of course, but it looks pretty good, so I'm going to resolve it. Incident resolved. Nice. Now let's change. Let's imagine another scenario. It's a week later and we are in the team that owns this payment service and we are in the operational review.

(16:27):

Let me show you. Service Center, as I mentioned before, we believe that it's not just about solving incidents but also learning from them, understanding trends and iterating every week. In here you can see all the list of services that we have in this system. Let's take a look to the payment service. In this page you can see at the glance how this service is performing. You can see also who is on call, how many incidents happened within the last seven days. You can see also how are the SLOs performing If something is deteriorating, we have some links to some useful dashboards regarding the SLOs. You can also see here the alerting activity for the alerts that are related to the service and also the activity of the alert groups related to the service. Speaking of SLOs, let me show you a new feature because now we have the ability to create SLO reports.

(17:43):

You can select the SLOs, then the time window. I'm going to select weekly, and here we have the SLO report. So we believe that Service Center SLO reports are great tools for operational excellent reviews with your team. Now let's talk about alert fatigue because we know that alert fatigue is one of the main reasons for missing incidents and delayed responses. We highly recommend moving most of your alerts to SLOs in order to reduce this alert fatigue and also focus on what really matters. But we also know that one of the reasons for having alert fatigue is having flapping alerts. This is the central alert state history and here you can see at the glance the activity of your alerts. For this demo, I created this flapping alert, as you can see is changing the status constantly and we added this AI button with this AI button. This AI button is an integration with Assistant. It's going to take this list of events. We are doing some prompt engineering and we are injecting this to the assistant and Assistant is going to analyze and detect patterns and is going to give us some recommendations.

(19:22):

Let's see what happens. As you can see here, Assistant has detected this flapping alert behavior and it's giving us some investigation priorities, some insights, and some next steps. So in a few seconds we can detect these kind of patterns and we can get also an actionable guidance to reduce this alert fatigue and focus on what really matters. Speaking of AI buttons, let me show you the last feature. We know that one of the things our users struggle with the most is creating notification templates because it requires go templating knowledge and sometimes this can be very hard. So then we decided to add this AI button that when you click on it, you can create a notification template being declarative instead of imperative. So you can ask, I want a message like this. You can type in here or you can pick one of the provided examples. I'm going to pick this one template that produce a message like this and you click generate template. And here we have the definition and the preview. We believe that this AI button is going to help our users to reduce time and complexity when creating notification templates. That's the demo. Back to you, Devin.

Devin Cheevers (21:00):

Nice job, Sonia. Alright, so I'm just going to do a quick recap of what we saw. It's the alert lifecycle through Grafana Cloud. So the alert fired, you saw it get enriched with log lines and also the Assistant Investigation. We think that's an obvious way that observability native IRM is helping you. The alert was routed to the right person in this case, Sonia, showed up on her phone. IRM, our offering has really rich, powerful routing logic, and then the AI investigation was triggered automatically. Again, an example of something that's better with observability native, then we debugged it or Sonia debugged it from her phone, declared the incident, and then did a bunch of coordination. As we talked about earlier, coordination is a real pain point for many teams. It's really beneficial if that tool or the coordination is happening in the tool where your engineers are already debugging.

(21:56):

So we showed you the status update feature. We showed you debugging with AI, and we showed you also using roles to help coordinate. Then resolve, we created that, that PIR doc was created automatically, and then we showed you using tasks to do follow-up tasks that you can link with GitHub. Then switching from firefighting to operational reviews or improving from what you've seen. We showed off Service Center. That's a GA feature, showed you how you can use Service Center to review trends. We also used some of Grafana Assistant to run an analysis of your alerts to tune them, improve them both. Those things we feel do really well when you're doing that on top of your observability platform. So that was a demo. That was a bit of the setup and framing. I'm going to move a little bit into roadmap and some action items for everyone in the room.

(22:45):

So just some highlights, not everything that's coming. But on the IRM side, we're really focused in the near term on workflows. So today you can use webhooks within Grafana IRM, we're ready to integrate it with third party tools. We want to make a far richer offering around workflows that's more flexible. Obviously that could benefit also any agentic workflows. The Grafana AI investigations feature, we're moving rapidly hoping to GA it soon and continue to invest there and integrate it into Grafana Cloud IRM. And then on alerting. So alerting, we know many of you use Grafana alerting. Maybe you're not using IRM yet. Maybe you're not using AI. We continue to invest in Grafana alerting, including coming soon improvements to how we're helping you triage your alerts. So yeah, sort of three takeaways. We really do believe that building a great engineering culture includes, I'm sorry, building effective IRM is a cultural problem.

(23:43):

Your tooling, we believe if it's as close to your observability data as possible, it can really help. And a design philosophy around AI is obviously—not obviously, it's keeping humans in the loop. That's a fundamental principle that we're using within Grafana development. What to do next? So we do have an observability maturity framework that you can see on our website. And take a questionnaire and understand where you're at. If you're a little bit more on the earlier stage of your observability journey, what we would call the reactive stage. Highly encourage you to check out our SLOs product. It's included in Grafana Cloud. Potentially get a beachhead team to try that out. Grafana Assistant is fantastic in terms of helping teams adopt observability best practices and use Grafana. If you're a little bit more towards the proactive stage, maybe you've got SLOs set up, maybe you're doing some of the early stages of incident processes. Consider operational reviews, consider using Service Center. And then yeah, the PIR process. We've found a lot of value internally at Grafana and with our customers on improving that. Finally, on the systematic stage, we do have an MCP server out. We do have Grafana Assistant and we have Assistant Investigations. So we'd love to work with you on using that as you move to more of agentic workflows. And I don't know if we have time, but with that, with Q+A, thanks so much.

Speakers

Devin Cheevers
Director of Product — Grafana Labs
Sonia Aguilar Peiron
Senior Software Engineer — Grafana Labs

Create a better reliability culture with AI + Grafana Cloud IRM, SLO, and Alerting

Speakers

Devin Cheevers

Sonia Aguilar Peiron

Still have questions?

Get every update