Documentation

Grafana Cloud

Alerts and IRM

IRM

Guides

Best practices

Incidents

Grafana Cloud

Best practices for incidents

Incidents represent significant events that require coordinated response. Effective incident management improves response times and enables learning from past events.

Understand incidents

Before creating and managing incidents in IRM, understand what they are and when to use them.

What is an incident

An incident is a formal record of a significant event affecting your services. Incidents provide a coordination point for response activities and a record for post-incident review.

Incidents differ from alert groups:

Aspect	Alert group	Incident
Creation	Automatic from alerts	Manual or via escalation step
Purpose	Group related alerts	Coordinate response to significant events
Scope	Technical signals	Business impact and coordination
Lifecycle	Firing → Acknowledged → Resolved	Customizable status progression
Communication	Notifications to responders	Stakeholder updates, dedicated channels

When to use incidents

Create incidents for events that need:

Coordinated response across multiple people or teams.
Formal tracking for compliance or reporting.
Stakeholder communication beyond the on-call team.
Post-incident review and learning.

Not every alert needs an incident. Use alert groups for routine alerts that on-call can handle independently.

Relationship to alert groups

Incidents can have alert groups attached to them:

Alert groups provide technical context for the incident.
Up to 5 alert groups can be attached per incident.
Labels flow from alert groups to incidents when declaring automatically.

For large-scale incidents, group related alerts effectively before attaching.

Creating incidents

Incidents can be created automatically through escalation chains or manually by responders.

Automatic creation

Use the Declare incident escalation step for automatic creation:

1. Notify on-call
2. Wait 10 minutes
3. Declare incident (severity: major)
4. Notify incident commander

Best for:

Critical alerts that always warrant incidents.
Alerts matching specific patterns (high severity, production impact).
Standardizing incident creation across teams.

Limitation: Incident declaration only works on non-default routes. Configure specific routes for alerts that should create incidents.

Manual creation

Create incidents manually when:

Multiple related alert groups need coordination.
Customer-reported issues aren’t detected by monitoring.
Security incidents require formal tracking.
Business events affect operations.

Attaching alert groups

When you attach alert groups to incidents:

Related alerts are correlated together.
Technical context is preserved with the incident.
You can track which alerts contributed to the incident.

Attach alert groups during incident creation or add them later as you identify related alerts.

Configure your incident workflow

Customize severity levels, statuses, and labels to match your organization’s process.

Severity levels

IRM lets you define custom severity levels. Design them based on your SLAs and team capacity.

Refer to the following configuration example:

Severity	Response time	Examples
Critical	Immediate	Complete outage, data loss
Major	Within 15 min	Significant degradation
Minor	Within 1 hour	Limited impact
Warning	Next business day	Potential issues

This is just an example. Create severity levels that reflect your operational requirements.

Status progression

Define statuses that reflect your incident management process.

For example:

Declared: Incident created, initial response starting.
Acknowledged: Responders engaged, investigation underway.
Mitigated: Impact reduced, full resolution pending.
Resolved: Incident fully resolved.
Closed: Post-incident activities complete.

Design statuses based on your team’s workflow and reporting needs.

Incident labels

Labels enable filtering, routing, and analytics for incidents.

Label sources:

Static labels: Set at the integration level, applied to all incidents from that source.
Dynamic labels: Transferred from alert groups when declaring an incident.
Manual labels: Added during the incident lifecycle as new information emerges.

Essential labels:

service_name: The affected service (required for Service Center).
severity: Incident severity level.
team: Responsible team.
environment: Production, staging, and so on.

Label flow:

Alert Rule Labels → Alert Group Labels → Incident Labels
      ↓                    ↓                   ↓
 (automatic)          (templates)          (manual)

Labels flow through the lifecycle, with each stage able to add or modify labels.

Service Center integration

Service Center provides a unified view of operational health by connecting alerts, alert groups, incidents, and SLOs.

The service_name label

The service_name label ties everything together in Service Center:

Alerts with service_name appear in that service’s view.
Alert groups inherit service_name from alerts.
Incidents inherit service_name from alert groups.
SLOs are associated with services.

Best practice: Ensure service_name is consistently applied across all alerts.

Benefits

Unified view: See all operational activity for a service in one place.
On-call handoffs: Review recent incidents during shift changes.
Operational reviews: Analyze trends and patterns per service.
SLO correlation: Connect incidents to SLO impact.

Enabling Service Center

Define services in Service Center.
Ensure alerts include service_name labels.
Configure label templates to preserve service_name.
Verify incidents appear in Service Center views.

During an incident

Keep stakeholders informed and coordinate response throughout the incident lifecycle.

Manage incidents from Slack

The Slack integration helps your teams coordinate incident response, with some of the following benefits:

Dedicated channels: Create incident-specific channels for coordination.
Channel naming: Use consistent prefixes like #inc- for easy identification.
Automated updates: Post status changes to incident channels.
Timeline sync: Activity in Slack appears in the incident timeline.

Communication and announcements

Configure incident announcements to:

Notify stakeholders when incidents are declared.
Provide status updates during response.
Communicate resolution to affected parties.

Best practice: Define announcement templates for consistency across incidents.

Status updates

Update incident status as the situation evolves:

Change severity if impact assessment changes.
Progress through statuses as you move from investigation to mitigation to resolution.
Add timeline entries to document key decisions and actions.

After resolution

Complete post-incident activities to improve future response.

Resolution notes

Add resolution notes to document:

Root cause of the incident.
Steps taken to resolve.
Lessons learned.

Resolution notes build institutional knowledge and improve future response.

Incident review

After resolution, complete the incident record:

Finalize the incident timeline.
Add resolution notes.
Attach all relevant alert groups.
Update labels for accurate analytics.

Analytics and reporting

Use incident data for operational insights:

Trend analysis: Identify recurring issues.
Response metrics: Track MTTR (Mean Time to Resolve) and MTTA (Mean Time to Acknowledge).
Service health: Correlate incidents with SLO performance.
Capacity planning: Understand incident frequency and impact.

Continuous improvement

Leverage incident insights to improve your systems:

Alert quality: Reduce noise by tuning thresholds and grouping.
Escalation chains: Speed response with better notification paths.
Runbooks: Improve documentation based on resolution patterns.
Monitoring: Enable earlier detection of similar issues.

Best practices summary

Understand the difference: Use incidents for coordination, alert groups for routine alerts.
Automate when appropriate: Use escalation steps for critical alerts that always need incidents.
Apply consistent labels: Especially service_name for Service Center integration.
Configure your workflow: Design severity levels and statuses for your organization.
Communicate proactively: Keep stakeholders informed throughout the lifecycle.
Document resolutions: Add resolution notes for future learning.
Review and improve: Use incident data to drive continuous improvement.