---
title: "Best practices for incidents | Grafana Cloud documentation"
description: "Best practices for incident management and Service Center integration in Grafana IRM."
---

> For a curated documentation index, see [llms.txt](/llms.txt). For the complete documentation index, see [llms-full.txt](/llms-full.txt).

# Best practices for incidents

Incidents represent significant events that require coordinated response. Effective incident management improves response times and enables learning from past events.

## Understand incidents

Before creating and managing incidents in IRM, understand what they are and when to use them.

### What is an incident

An incident is a formal record of a significant event affecting your services. Incidents provide a coordination point for response activities and a record for post-incident review.

**Incidents differ from alert groups:**

Expand table

| Aspect        | Alert group                      | Incident                                  |
|---------------|----------------------------------|-------------------------------------------|
| Creation      | Automatic from alerts            | Manual or via escalation step             |
| Purpose       | Group related alerts             | Coordinate response to significant events |
| Scope         | Technical signals                | Business impact and coordination          |
| Lifecycle     | Firing → Acknowledged → Resolved | Customizable status progression           |
| Communication | Notifications to responders      | Stakeholder updates, dedicated channels   |

### When to use incidents

Create incidents for events that need:

- Coordinated response across multiple people or teams.
- Formal tracking for compliance or reporting.
- Stakeholder communication beyond the on-call team.
- Post-incident review and learning.

Not every alert needs an incident. Use alert groups for routine alerts that on-call can handle independently.

### Relationship to alert groups

Incidents can have alert groups attached to them:

- Alert groups provide technical context for the incident.
- Up to 5 alert groups can be attached per incident.
- Labels flow from alert groups to incidents when declaring automatically.

For large-scale incidents, group related alerts effectively before attaching.

## Creating incidents

Incidents can be created automatically through escalation chains or manually by responders.

### Automatic creation

Use the **Declare incident** escalation step for automatic creation:

text ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```text
1. Notify on-call
2. Wait 10 minutes
3. Declare incident (severity: major)
4. Notify incident commander
```

**Best for:**

- Critical alerts that always warrant incidents.
- Alerts matching specific patterns (high severity, production impact).
- Standardizing incident creation across teams.

**Limitation:** Incident declaration only works on non-default routes. Configure specific routes for alerts that should create incidents.

### Manual creation

Create incidents manually when:

- Multiple related alert groups need coordination.
- Customer-reported issues aren’t detected by monitoring.
- Security incidents require formal tracking.
- Business events affect operations.

### Attaching alert groups

When you attach alert groups to incidents:

- Related alerts are correlated together.
- Technical context is preserved with the incident.
- You can track which alerts contributed to the incident.

Attach alert groups during incident creation or add them later as you identify related alerts.

## Configure your incident workflow

Customize severity levels, statuses, and labels to match your organization’s process.

### Severity levels

IRM lets you define custom severity levels. Design them based on your SLAs and team capacity.

Refer to the following configuration example:

Expand table

| Severity | Response time     | Examples                   |
|----------|-------------------|----------------------------|
| Critical | Immediate         | Complete outage, data loss |
| Major    | Within 15 min     | Significant degradation    |
| Minor    | Within 1 hour     | Limited impact             |
| Warning  | Next business day | Potential issues           |

This is just an example. Create severity levels that reflect your operational requirements.

### Status progression

Define statuses that reflect your incident management process.

For example:

1. **Declared:** Incident created, initial response starting.
2. **Acknowledged:** Responders engaged, investigation underway.
3. **Mitigated:** Impact reduced, full resolution pending.
4. **Resolved:** Incident fully resolved.
5. **Closed:** Post-incident activities complete.

Design statuses based on your team’s workflow and reporting needs.

### Incident labels

Labels enable filtering, routing, and analytics for incidents.

**Label sources:**

- **Static labels:** Set at the integration level, applied to all incidents from that source.
- **Dynamic labels:** Transferred from alert groups when declaring an incident.
- **Manual labels:** Added during the incident lifecycle as new information emerges.

**Essential labels:**

- `service_name`: The affected service (required for Service Center).
- `severity`: Incident severity level.
- `team`: Responsible team.
- `environment`: Production, staging, and so on.

**Label flow:**

text ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```text
Alert Rule Labels → Alert Group Labels → Incident Labels
      ↓                    ↓                   ↓
 (automatic)          (templates)          (manual)
```

Labels flow through the lifecycle, with each stage able to add or modify labels.

## Service Center integration

[Service Center](/docs/grafana-cloud/alerting-and-irm/service-center/) provides a unified view of operational health by connecting alerts, alert groups, incidents, and SLOs.

### The service\_name label

The `service_name` label ties everything together in Service Center:

- Alerts with `service_name` appear in that service’s view.
- Alert groups inherit `service_name` from alerts.
- Incidents inherit `service_name` from alert groups.
- SLOs are associated with services.

**Best practice:** Ensure `service_name` is consistently applied across all alerts.

### Benefits

- **Unified view:** See all operational activity for a service in one place.
- **On-call handoffs:** Review recent incidents during shift changes.
- **Operational reviews:** Analyze trends and patterns per service.
- **SLO correlation:** Connect incidents to SLO impact.

### Enabling Service Center

1. Define services in Service Center.
2. Ensure alerts include `service_name` labels.
3. Configure label templates to preserve `service_name`.
4. Verify incidents appear in Service Center views.

## During an incident

Keep stakeholders informed and coordinate response throughout the incident lifecycle.

### Manage incidents from Slack

The Slack integration helps your teams coordinate incident response, with some of the following benefits:

- **Dedicated channels:** Create incident-specific channels for coordination.
- **Channel naming:** Use consistent prefixes like `#inc-` for easy identification.
- **Automated updates:** Post status changes to incident channels.
- **Timeline sync:** Activity in Slack appears in the incident timeline.

### Communication and announcements

Configure incident announcements to:

- Notify stakeholders when incidents are declared.
- Provide status updates during response.
- Communicate resolution to affected parties.

**Best practice:** Define announcement templates for consistency across incidents.

### Status updates

Update incident status as the situation evolves:

- Change severity if impact assessment changes.
- Progress through statuses as you move from investigation to mitigation to resolution.
- Add timeline entries to document key decisions and actions.

## After resolution

Complete post-incident activities to improve future response.

### Resolution notes

Add resolution notes to document:

- Root cause of the incident.
- Steps taken to resolve.
- Lessons learned.

Resolution notes build institutional knowledge and improve future response.

### Incident review

After resolution, complete the incident record:

1. Finalize the incident timeline.
2. Add resolution notes.
3. Attach all relevant alert groups.
4. Update labels for accurate analytics.

### Analytics and reporting

Use incident data for operational insights:

- **Trend analysis:** Identify recurring issues.
- **Response metrics:** Track MTTR (Mean Time to Resolve) and MTTA (Mean Time to Acknowledge).
- **Service health:** Correlate incidents with SLO performance.
- **Capacity planning:** Understand incident frequency and impact.

### Continuous improvement

Leverage incident insights to improve your systems:

- **Alert quality:** Reduce noise by tuning thresholds and grouping.
- **Escalation chains:** Speed response with better notification paths.
- **Runbooks:** Improve documentation based on resolution patterns.
- **Monitoring:** Enable earlier detection of similar issues.

## Best practices summary

- **Understand the difference:** Use incidents for coordination, alert groups for routine alerts.
- **Automate when appropriate:** Use escalation steps for critical alerts that always need incidents.
- **Apply consistent labels:** Especially `service_name` for Service Center integration.
- **Configure your workflow:** Design severity levels and statuses for your organization.
- **Communicate proactively:** Keep stakeholders informed throughout the lifecycle.
- **Document resolutions:** Add resolution notes for future learning.
- **Review and improve:** Use incident data to drive continuous improvement.

## Next steps

- [Configure incident settings](/docs/grafana-cloud/alerting-and-irm/irm/manage-incidents/customize-incident-response) for your organization
- [Incident management workflows](/docs/grafana-cloud/alerting-and-irm/irm/manage-incidents)
- [Slack integration](/docs/grafana-cloud/alerting-and-irm/irm/integrations/chat-and-collaboration/slack) for chat-based response
- [Configure labels](/docs/grafana-cloud/alerting-and-irm/irm/escalation-and-routing/labels) for incident tracking
