Grafana Cloud

Best practices for on-call schedules

On-call schedules determine who receives pages when escalation chains execute. Well-designed schedules ensure reliable coverage while respecting team members’ time.

Understand on-call schedules

Before building schedules, understand their role in IRM and how they’re evaluated.

What is a schedule

A schedule defines who is on-call at any given time. Schedules contain shifts that assign users or user groups to specific time periods.

Schedules connect escalation chains to responders:

text
Alert → Route → Escalation Chain → Schedule → On-call Responder

How schedules connect to escalation chains

Escalation chains reference schedules through notification steps. When a step like “Notify users from on-call schedule” executes, IRM checks who is currently on-call.

Example chain using multiple schedules:

text
1. Notify on-call from "Primary Schedule"
2. Wait 5 minutes
3. Notify on-call from "Secondary Schedule"
4. Wait 10 minutes
5. Notify on-call from "Management Schedule"

This pattern enables:

  • Primary and secondary on-call.
  • Follow-the-sun coverage.
  • Escalation to management after hours.

Dynamic evaluation

IRM evaluates schedules at execution time, not when the alert was created.

What this means:

  • Schedule changes take effect immediately for pending escalation steps.
  • If someone swaps shifts while an alert is escalating, the new on-call person receives the next notification.
  • You don’t need to worry about in-flight alerts using outdated schedule data.

This is different from escalation chains, which are snapshotted when an alert group is created.

Choose a schedule type

IRM supports three ways to manage schedules. Choose based on your team’s workflow.

Web schedules

The built-in schedule editor in the IRM UI.

Best for:

  • Teams new to on-call management.
  • Simple rotation patterns.
  • Teams who prefer visual schedule management.

Features:

  • Drag-and-drop shift editing.
  • Visual rotation preview.
  • Override management in UI.

iCal schedules

Import schedules from external calendar systems.

Best for:

  • Teams with existing calendar-based schedules.
  • Migration from PagerDuty, Opsgenie, or other tools.
  • Organizations using shared calendar systems.

Considerations:

  • Schedule changes require updating the external calendar.
  • IRM periodically syncs from the iCal URL.
  • Limited editing capabilities within IRM.

API/Terraform schedules

Manage schedules through the API or Infrastructure as Code.

Best for:

  • Teams using Terraform or other IaC tools.
  • Automated schedule management.
  • Version-controlled schedule configurations.

Features:

  • Full API control over shifts and rotations.
  • Can enable web-based overrides while managing primary schedule via API.
  • Integrates with existing deployment pipelines.

Comparison

TypeManagementBest forFlexibility
WebUISimple rotations, visual editingHigh
iCalExternal calendarMigration, existing calendarsLow
API/TFCodeAutomation, version controlHigh

Design on-call rotations

Design rotations that provide reliable coverage while distributing work fairly.

Rotation patterns

Choose a pattern based on your team size and alert frequency:

PatternDurationUse case
Daily24 hoursHigh-frequency alerts, distributed teams
Weekly7 daysMost common, good work-life balance
Bi-weekly14 daysSmaller teams, less frequent alerts

Combine patterns with different shift lengths:

  • 12-hour shifts with weekly rotation: Two rotations covering day and night.
  • Business hours shifts (9:00-18:00): Aligned with standard work hours.
  • Extended shifts (36 hours): Bi-daily rotation for longer coverage periods.

For visual examples, refer to On-call schedule examples.

Set rotation start explicitly

Always set the Rotation start (called rotation_start in the API/Terraform) explicitly.

Why this matters:

  • Shift start: When the shift pattern begins each day/week.
  • Rotation start: When the rotation through users begins.

These can differ when you want the rotation to align with a specific date, like the start of a sprint, while shifts cover different hours.

User group rotation

For teams with varying availability, use user groups instead of individual users:

  1. Create user groups (arrays of users).
  2. Each rotation moves to the next group.
  3. All users in a group are on-call simultaneously.

Use cases:

  • Primary and secondary on-call: Two users on-call at once.
  • Follow-the-sun: Groups in different time zones.
  • Graduated response: Junior and senior pairing.
  • Shadow coverage: Onboarding new teammates.
  • Backup coverage: Extra support during critical periods.

Handle schedule changes

Schedules need to accommodate vacations, sick days, and unexpected changes.

Shift swaps vs overrides

Shift swaps exchange shifts between two users:

  • Maintains rotation continuity.
  • The swapped user returns to their normal position after.
  • Easier to track and audit.

Overrides replace the scheduled user with someone else:

  • Creates a one-time exception.
  • Doesn’t affect the underlying rotation.
  • Use for coverage when swapping isn’t possible.

Best practice: Prefer shift swaps for planned changes. Use overrides for last-minute coverage.

Override priority

When shifts overlap, priority determines which takes precedence.

  • Higher Priority (called priority_level in the API/Terraform) wins.
  • Overrides typically use priority 99 (highest).
  • Primary shifts use lower priorities (0-10).

Set priorities intentionally to ensure overrides work as expected.

Timezone considerations

Configure timezone settings to avoid confusion for distributed teams.

Enable timezone support (called use_tz in the API/Terraform) for web schedules:

  • Shifts respect the schedule’s timezone.
  • Daylight saving time is handled automatically.
  • Schedules are clearer for distributed teams.

Without timezone support (legacy):

  • Shifts are stored as UTC.
  • Manual adjustment is needed for DST.
  • This can cause confusion across time zones.

Test timezone changes carefully:

A rotation starting “Monday 9am” in US/Pacific might be Monday or Tuesday in UTC depending on DST. Changing timezones can unexpectedly shift the rotation day.

Ensure coverage quality

Monitor schedules to identify gaps and ensure fair distribution.

Gap and empty shift reports

Enable schedule quality reports (called enabled_reports in the API/Terraform) to detect issues:

  • Gaps: Time periods with no on-call coverage.
  • Empty shifts: Shifts with no users assigned.

Both indicate coverage problems that should be addressed before they cause missed alerts.

Quality metrics

IRM calculates schedule quality metrics:

  • Coverage percentage: Time with on-call coverage versus total time.
  • Balance score: How evenly work is distributed among team members.
  • Overloaded users: Team members with significantly more on-call time.

Review these metrics regularly to ensure fair and complete coverage.

Fair distribution

On-call work should be distributed fairly across the team:

  • Monitor balance scores to identify overloaded team members.
  • Adjust rotations if some users consistently carry more load.
  • Consider timezone distribution for follow-the-sun schedules.
  • Account for holidays and time off when calculating fairness.

Best practices summary

  • Understand dynamic evaluation: Schedules are checked at execution time, not alert creation.
  • Choose the right type: Web for simplicity, iCal for integration, API for automation.
  • Set rotation start explicitly: Don’t rely on default behavior.
  • Use shift swaps: Prefer swaps over overrides for planned changes.
  • Enable timezone support: For new schedules, use timezone-aware shifts.
  • Monitor quality: Enable gap and empty shift reports.
  • Distribute fairly: Review balance scores regularly.
  • Test changes: Verify timezone and rotation changes before production.

Next steps