Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

Grot cannot remember your choice unless you click the consent notice at the bottom.

Grafana Incident: new tools for faster, simpler incident response

Grafana Incident: new tools for faster, simpler incident response

6 May, 2024 5 min

At Grafana Labs, we’re committed to helping teams dramatically improve how they manage and respond to incidents. Through Grafana Incident Response & Management (IRM), we provide tools to empower teams, streamline processes, and enhance the effectiveness of incident management strategies—and we’re constantly looking for ways to make our solution even better.

A central component of this strategy is Grafana Incident, which removes the toilsome tasks of incident management so you can focus on actually fixing the issue faster. We introduced Grafana Incident two years ago, and we’ve made a ton of improvements since then. In this blog post, we want to highlight some of the recent innovations that have reshaped our approach to incident management and provide a quick preview of what’s next so your team is well-equipped for the challenges ahead.

Rapid response with Sift investigations

Your initial response to an incident can play a huge role in how quickly you’re able to resolve the problem. Sift, a diagnostic assistant in Grafana Cloud, utilizes Grafana Machine Learning to quickly filter through data, identifying and prioritizing issues in real-time. For instance, during a major system disruption, Sift can isolate and diagnose error patterns or Kubernetes container failures, enabling teams to commence remediation efforts without delay. This not only speeds up the response times but also significantly reduces system downtime. Explore Sift investigations.

Streamlined incident collaboration via Grafana OnCall

Grafana IRM is actually a suite of services (Grafana Alerting, Grafana Incident, Grafana OnCall, Grafana SLOs) and we’ve been working hard to better integrate the tooling. For example, we’re improving team coordination during incidents by pairing Grafana Incident with Grafana OnCall. 

This allows relevant team members to be automatically notified based on their availability and preferred contact methods. When a critical issue arises, the feature ensures that the right personnel, such as the on-call database engineers, are quickly brought into the loop, fostering a more efficient resolution process. Learn more about adding participants.

AI-driven summaries with OpenAI integration

The OpenAI integration automatically generates concise, actionable summaries of incidents. This analysis not only captures the essence of the incident quickly but also helps teams ensure no critical details are overlooked when documenting and communicating incident impacts. Discover the OpenAI Integration.

Communication and reporting enhancements

Clear, concise communication is key during incidents. The last thing you need when you’re trying to get your service back up and running is to have an important message or handoff fall through the cracks. That’s why we’ve made enhancements to improve communication flows during incidents. 

Features like Slack Attachment Uploads allow teams to append critical messages and files directly to the incident timeline by simply reacting with a robot emoji. This ensures all pertinent information is centralized and accessible, streamlining the documentation process. Streamline documentation with Slack.

Moreover, the ability to declare incidents directly from Grafana OnCall alerts or any Grafana panel has been a game-changer, enabling teams to initiate responses instantly upon detection of anomalies. This rapid declaration capability ensures that potential issues are addressed before they escalate, maintaining operational continuity and minimizing impact.

Anticipated innovations on the horizon

As Grafana Incident continues to evolve, our roadmap includes several exciting features designed to enhance how teams manage and respond to incidents. Here are just a few of the updates we’re currently working on:

  • Incident types and private incidents. We are expanding our capabilities to include private incidents, which enhance security and confidentiality. This will allow teams to handle sensitive data more securely and support incidents that require privacy, such as security breaches. These private incidents can be tailored with specific visibility settings to ensure that sensitive information remains confidential.
  • Incident phases. Customizable incident phases enable teams to define and manage every stage of an incident according to their specific operational procedures. This flexibility ensures that our tool aligns with your internal processes, making incident management more intuitive and integrated with your existing workflows.
  • Single Slack App for Grafana OnCall and Grafana Incident. To streamline the user experience, we’re merging our Slack applications into a single Slack app for both tools. This integration will simplify the setup and maintenance of our tools within your Slack environment, reducing complexity and enhancing the flow of information between teams.
  • Flow Labels. We plan to add automatic label synchronization from Grafana OnCall alert groups to incidents. This feature will help maintain continuity and context throughout the incident lifecycle, ensuring that all relevant data is carried over automatically when an alert escalates into an incident
  • Incident features in Grafana IRM mobile app. Our mobile capabilities are expanding to include comprehensive incident management features, allowing responders to manage incidents directly from their mobile devices. This ensures that team members can respond promptly to incidents, even when away from their desks, which is crucial for maintaining high availability and quick responses.
  • Automatically declare an incident as part of a Grafana OnCall escalation chain. We plan to implement the ability to automatically declare incidents as part of the escalation process within Grafana OnCall. This will provide a seamless transition from alert to incident management, helping teams to mobilize quicker and with greater context, reducing response times and potential damages.

We’re excited for what’s to come, but in the meantime, we invite you to explore these new features that are currently available and integrate them into your incident management strategy. Stay tuned for more updates and insights as we continue to evolve our platform amid an ever-changing technology landscape. 

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!