Documentation

Grafana Cloud

Alerts and IRM

IRM

Guides

Best practices

Escalation chains

Grafana Cloud

Best practices for escalation chains

Escalation chains define how IRM notifies your team when alerts arrive. Well-designed chains ensure timely response while preventing notification fatigue.

Understand escalation chains

Before building escalation chains, understand their role in the alert flow and how IRM executes them.

What is an escalation chain

An escalation chain is a sequence of steps that IRM executes when an alert group is created. Each step can notify users, wait for a response, or perform actions like declaring an incident.

Escalation chains connect routes to responders:

Alert → Route → Escalation Chain → Schedule → Responder

How chains fit into the alert flow

Routes determine which escalation chain handles an alert group. The chain then executes its steps in order until someone acknowledges or resolves the alert.

For more information about routing, refer to the Alert routing best practices.

Build basic chains

Start with simple escalation patterns before adding complexity.

The following example shows the basic pattern that most escalation chains follow:

1. Notify on-call from schedule
2. Wait 5 minutes
3. Notify on-call from schedule (important)
4. Wait 10 minutes
5. Notify backup schedule

This pattern:

Starts with a standard notification.
Waits for acknowledgment.
Escalates to important notification if no response.
Eventually reaches a backup.

Wait steps

Wait steps space out notifications to prevent fatigue and give responders time to act.

Wait duration	Use case
1 minute	Urgent alerts, quick acknowledgment expected
5 minutes	Standard alerts, reasonable response time
15 minutes	Lower priority, allows investigation time
30-60 minutes	Informational alerts, batch processing

Tip
Start with longer waits and shorten based on actual response times.

Terminal steps

Every chain should end definitively. Without a terminal step, escalation can continue indefinitely.

Options for ending a chain:

Resolve: Automatically resolve if no response is needed.
Notify all: Escalate to an entire channel as a last resort.
Repeat: Restart the chain a limited number of times.

Notification step types

Choose the right notification type for each stage of your escalation. Some common notification steps include:

Schedule-based notifications

Use Notify users from on-call schedule when:

You have a defined on-call rotation.
You want automatic rotation without updating chains.
You need follow-the-sun coverage.

IRM evaluates the schedule when the step executes, not when the alert was created. This means schedule changes take effect immediately for pending escalation steps.

User queue notifications

Use Notify users from queue when:

You want round-robin distribution across a fixed set of users.
Multiple users should share the alert load.
Your team doesn’t have a formal on-call rotation.

Round-robin behavior: Each escalation notifies the next user in the queue. IRM tracks the position per alert group, cycling through all users.

For all available escalation steps, refer to Configure escalation chains.

Default vs. Important notifications

For each notification step, you also need to specify whether to use Default or Important notification.

This refers to the two sets of notification rules that are configured in each user’s IRM profile.

To learn more about default and important notifications, refer to Types of notification rules.

When to use important notifications:

After initial notification attempts fail to get a response.
For truly critical alerts that need immediate attention.

Pattern:

1. Notify on-call (default)
2. Wait 5 minutes
3. Notify on-call (important)  ← Escalate to important

Caution
Overusing important notifications reduces their effectiveness. Reserve them for genuine escalation within a chain.

Advanced patterns

Use these patterns for more sophisticated escalation logic.

Time-based routing

Use Continue if current UTC time is in range to route differently by time of day:

1. Check if 9am-6pm UTC
   → Yes: Notify business hours team
   → No: Continue to next step
2. Notify after-hours team

This enables:

Business hours versus after-hours escalation.
Weekend-specific routing.
Holiday coverage.

Alert volume throttling

Use Notify if number of alerts in time window to throttle low-priority escalations:

1. Check if >5 alerts in 30 minutes
   → Yes: Continue escalation
   → No: Pause escalation

This prevents paging for sporadic low-priority alerts while still escalating patterns that indicate a real problem.

Repeat escalation

Use Repeat escalation N times to restart the chain if no one responds:

1. Notify primary on-call
2. Wait 5 minutes
3. Notify secondary on-call
4. Wait 10 minutes
5. Repeat escalation (max 3 times)

Note
Maximum 5 repeats to prevent infinite loops.

Declare incident

Use Declare incident to automatically create an incident from an alert group:

1. Notify on-call
2. Wait 5 minutes
3. Declare incident (severity: major)
4. Notify incident commander schedule

Note
Incident declaration only works on non-default routes. Configure specific routes for alerts that should trigger automatic incidents.

Organize your escalation chains

Good organization makes chains easier to maintain and debug during incidents.

One chain per escalation path

Create separate chains for different escalation needs:

payments-critical: Fast escalation for payment issues.
payments-warning: Slower escalation for warnings.
platform-business-hours: Business hours only.
platform-24x7: Round-the-clock coverage.

Naming conventions

Use clear, descriptive names that help responders understand the chain’s purpose.

Include in the name:

Include the team or service name.
Include the severity or priority level.
Include time-based behavior if applicable.

For example:

auth-team-critical-24x7
data-pipeline-business-hours
infrastructure-p1-immediate

Reuse chains across routes

Chains can be used by multiple routes. Design reusable chains for common patterns:

Create generic severity-based chains that multiple teams can use.
Create team-specific chains shared across services.
Create standard escalation patterns for common scenarios.

Snapshot behavior

IRM snapshots escalation chains when it creates an alert group. This is important to understand before building chains.

What gets snapshotted:

Chain configuration and steps.
User queue positions.
Schedule references (but not schedule contents).

What this means:

Changes to a chain don’t affect alert groups already using it.
To test chain changes, you need to create new alerts.
Active alert groups continue using the original chain configuration.

Schedules are different: While the chain is snapshotted, schedules are evaluated dynamically. When a step runs, IRM checks who is currently on-call at that moment.

Testing and tuning

Test chains thoroughly before deploying to production, and tune based on real-world performance.

Testing with non-production alerts

Before deploying chain changes:

Create a test integration.
Send test alerts through the chain.
Verify notifications reach the right people at the right times.

Remember that changes don’t affect existing alert groups due to snapshot behavior. Always test with new alerts.

Metrics to monitor

Track these metrics to understand chain effectiveness:

Time to first acknowledgment: How quickly do responders engage?
Escalation depth: How many steps run before someone responds?
False escalations: How often do alerts escalate that didn’t need human response?

Tuning based on response patterns

Use metrics to improve your chains:

High escalation depth: Shorten wait times or add more notification channels.
Frequent false escalations: Review alert quality or add throttling.
Slow acknowledgment: Consider adding important notification steps earlier.

Best practices summary

Start simple: Begin with notify → wait → escalate patterns.
End definitively: Always include a terminal step.
Space notifications: Use wait steps to prevent fatigue.
Use important sparingly: Reserve for genuine escalation.
Remember snapshots: Changes don’t affect active alert groups.
Name clearly: Descriptive names help during incidents.
Test with new alerts: Snapshot behavior means existing alerts use old chains.
Monitor and tune: Adjust based on actual response patterns.