Menu
Grafana Cloud RSS

Incident Response Management (IRM)

Grafana Cloud Incident Response Management (IRM) is a fully managed solution that streamlines your incident workflow. It integrates on-call scheduling, notifications, and incident management, empowering your team to detect issues, escalate alerts, automate responses, and gain actionable insights to resolve incidents quickly and effectively.

Products

Grafana IRM is built on two core products: Grafana OnCall and Grafana Incident.

Note

Grafana Cloud IRM is a paid add-on, billed based on monthly active IRM users. For more details, refer to Understand your Grafana Cloud IRM invoice.

Grafana OnCall

Grafana OnCall is a developer-friendly on-call management tool designed to automate escalations, define alert rules, and integrate with your existing alerting sources and third-party tools. With OnCall, you can create schedules, notify the right teams, and declare incidents directly from alerts.

To learn more, refer to the Grafana OnCall documentation.

Grafana Incident

Grafana Incident simplifies incident response by helping you define roles, automate task assignments, and create collaboration spaces. It integrates with popular tools like GitHub, Slack, and Google Suite to streamline your incident response processes.

To learn more, refer to the Grafana Incident documentation.

Incident response management in Grafana Cloud

Grafana is at the heart of your incident response management. With Grafana Alerting, SLOs, and Machine Learning in Grafana Cloud, you can further enhance your incident response by integrating additional Grafana Cloud products and features designed to improve reliability and streamline operations.

Grafana SLO (Service Level Objectives)

SLOs help you measure service quality, improve system reliability, and make data-driven decisions. Use SLOs to collect data on the reliability of your systems over time and provide better service to your customers.

To learn more, refer to the Grafana SLO documentation.

Grafana Alerting

Grafana Alerting consolidates both Grafana-managed alerts and alerts from Mimir or Loki-compatible data sources in one place. Alerting can be easily set up to integrate with Grafana OnCall and Grafana Incident so you can improve your team’s ability to identify and resolve issues quickly.

To learn more, refer to the Grafana Alerting documentation.

Grafana Machine Learning

Grafana IRM incorporates AI and machine learning capabilities to enhance decision-making and automate proactive incident responses. These tools help you predict issues, improve incident workflows, and reduce time to resolution.

To learn more, refer to the Grafana Machine Learning documentation.

How do they work together?

When things go wrong, Grafana dashboards are the go-to place for teams to find answers in metrics, logs, and traces and the last place they look to put together a postmortem. Grafana sits at the heart of incident response management. With Grafana SLO, Grafana Alerting, Grafana Incident, Grafana OnCall, and Grafana Machine Learning on Grafana Cloud, integrating IRM into familiar workflows is more convenient than ever before.

Use Grafana IRM to proactively detect issues, keep your services healthy, and easily respond to incidents. Utilize machine learning features throughout your IRM workflows to create alerts, sift through metadata during an incident, and never miss a detail in your post-incident review with Incident Auto-Summary.

Detect, respond, learn diagram that illustrates how Grafana IRM products work together

Detect

SLOs are the key to measuring how reliable your service should be. By providing key reliability targets in the form of SLIs and SLOs, you set stakeholder expectations and ensure transparency. Combine using SLOs with Grafana Alerting to track and generate alerts and send notifications, providing an efficient way for engineers to monitor, respond, and triage issues within their services.

Standard alerts and alert notifications provide a lot of value as key indicators to issues during the triage process, providing engineers with the information they need to understand what is going on in their system or service. Paired with SLOs, an SLO alert notifies teams of an issue and provides runtime behavior to aid in the triage process.

Dashboards and insights help you monitor the status of your SLOs and alerts and quickly identify crucial operational details.

Respond

When an alert is generated, leverage Grafana OnCall to ensure a swift and effective response. Let your on-call rotations and automated escalations route alerts to the right teams and notify on-call engineers using their preferred notification methods. With an intuitive API and versatile integration capabilities, the developer-first workflow allows for highly customizable configurations tailored to any use case.

After an issue is identified, Grafana Incident makes it easy to create incidents from alerts and immediately begin your response process. Grafana Incident simplifies response and provides a centralized platform for managing incidents so you can promptly assign roles, utilize built-in task management, and automate routine tasks such as creating an incident channel or a virtual meeting space for collaboration. Integrations with familiar tools like GitHub, Slack, and Google Suite enhance communication and coordination throughout the incident resolution process.

With a structured approach to incident management that alleviates the stress and burden of incident response, responders can focus on resolving the issue without additional distractions, such as communicating with stakeholders.

Learn

After an incident is resolved, Grafana Incident provides a centralized platform for a thorough review of incident details. Pull in relevant information from GitHub issues, Slack messages, and other integrated tools to extract actionable insights. Then, leverage ML-powered auto-summary generation to alleviate the post-incident review burden, ensuring no core findings are overlooked.

Grafana’s analytics capabilities provide teams with deeper insights into incident data that enable you to extract valuable lessons and refine your overall IRM strategy. Dashboards and insights remain essential for monitoring the status of SLOs and alerts, enabling teams to quickly identify operational details and make informed decisions during the post-incident review process. Turning incidents into valuable experiences that drive continuous growth significantly enhances the overall resilience of your systems and services.