Grafana Cloud

Create SLOs

Create SLOs to measure the quality of service you provide to your users.

Each SLO contains an SLI, a target, and an error budget. You can also choose to add SLO alert rules to to alert on your error budget burn rate.

In the following sections, we’ll guide you through the process of creating your SLOs.

To create an SLO, use the in-product setup wizard or use the following steps to help you to set up your SLOs.

  1. Define a service level indicator
  2. Set a target
  3. Add a name and description
  4. Add SLO alert rules
  5. Review and save your SLO

Before you begin

Before creating an SLO, complete the steps to Set up Grafana SLO.

Set up Alert Notifications [Optional]

To use a custom notification policy for your SLO alert rules, complete the following steps:

  1. Configure a notification policy and add SLO labels as label matchers in the Alerting app.

    Add the following SLO labels to make severity-based routing decisions. Custom labels that were added to the SLO or it’s alerting configuration can also be used.

    • grafana_slo_severity # “warning” or “critical”
  2. Configure contact points for the notification policy in the Alerting application.

Note

For fast-burn alert rules (ex: grafana_slo_severity="critical"), we suggest using a paging service, such as Grafana OnCall.

For slow-burn alert rules (ex: grafana_slo_severity="warning"), we suggest opening a ticket in Jira, ServiceNow, Github, or another ticketing system.

Define a service level indicator

Start with your service level indicator (SLI), the metric you want to measure for your SLO.

SLIs (Service Level Indicators) are the metrics you measure over time that inform about the health or performance of a service.

  1. Click Alerts&IRM -> SLO + Create SLO.

  2. Enter time window.

    Select a time frame over which to evaluate the SLO.

    Note

    The default time window for Grafana SLOs is 28 days, because it always captures the same number of weekends, no matter what day of the week it is. This accounts better for traffic variation over weekends than a 30 day SLO.

  3. Choose a data source.

    Select a data source from the data source drop-down picker.

    Note

    Grafana SLOs can be created for Grafana Cloud Metrics, Grafana Enterprise Metrics, Grafana Mimir, and select other metrics data sources.

    To use a Grafana Mimir data source, the Ruler API must be enabled, and the user creating an SLO must be allowed to create recording and alerting rules for the data source.

  4. Choose query type.

    Ratio query

    The Ratio Query Builder builds a properly formatted SLI Ratio Query without the need to write PromQL. Syntax, required variables, and expression format are generated based on the metrics entered. Users new to PromQL or SLOs that don’t require data massaging at the query level should use the Ratio Query Builder. Ratio queries require that you use Prometheus counter metrics for your metrics.

    a. Enter a Prometheus counter metric for the success metric.

    b. Enter a Prometheus counter metric for the totals metric.

    c. [Optional]: Enter grouping labels to generate an SLO with multiple series, breaking the SLO down into logical groups.

    Labels are useful, for example, if you have an SLO that tracks the performance of a service that you want to track by region, then you can see how much budget you have left per region.

    Grafana creates multidimensional alerts, so if you group by cluster, each cluster will have its own associated alerts.

    Note

    Grouping labels can change the number of Data Points per Minute (DPM) series that are produced. You can see the number of series created in the SLI ratio, where each series is represented by its own line on the graph displayed in the wizard. For example, grouping by “cluster” will multiply the series for each of the individual clusters for the SLO. This means that if you have 5 different clusters, the SLO will create 5 time series, each with same namespace on the Define SLI wizard.

    The generated expression and a preview of the query is displayed in the Computed SLI ratio window.

    Advanced query

    The advanced query builder allows users to write a freeform query and can be found throughout the Grafana product. You can choose between a graphical query building experience or writing PromQL in a code textbox.

    All advanced queries must use a ratio format.

    To create an advanced query, enter a query that returns a number between zero and one.

    For example, divide the rate of successful requests to a web service by the rate of total requests to the web service.

    Example:

    “Successful” events could be a count of requests to your web service that received a response in less than 300ms, and “total” events could be the count of all requests to your web service.

    See this example from the Prometheus documentation about successful events divided by total events:

    sum(rate(
      http_request_duration_seconds_bucket{le="0.3"}[$__rate_interval]
    )) by (job)
    /
    sum(rate(
      http_request_duration_seconds_count[$__rate_interval]
    )) by (job)

    Grafana Queries - Any Supported Data Source

    Select a non-Mimir data source from the dropdown and the UI replaces the query type selection with query editors for your selected datasource. Two empty queries are available, named “Success” and “Total,” and a math expression that calculates the ratio $Success / $Total. This is meant to encourage you to build a count-based ratio of successful events over total events but, as with the “Advanced Query” above, you can enter any query that returns a value between zero and one and get basic functionality.

    Example Graphite “Success” Query:

    groupByNode(perSecond(web.*.http.2xx_success.*.*), 1, 'avg')

    Example Graphite “Total” Query:

    groupByNode(perSecond(web.*.http.*.*.*), 1, 'avg')

    See additional data sources for a list of currently supported data sources.

Set a target

Set the target to a value that indicates “good performance” for the system.

For example, if you define a 99% target for requests with a status code of 200, then as long as at least 99% of requests have a 200 status code over your time window, you are meeting your SLO.

The error budget is the amount of error that your service can accumulate over a certain period of time before your users start being unhappy.

For example, a service with an SLO of 99.5% availability has an error budget of 0.5% downtime, which amounts to about 43 minutes of downtime over a 28-day period.

To set a target, enter a percentage greater than 0 and less than 100%.

The error budget is automatically calculated as 100% - target.

Statistical Predictions

Once you have entered your target, the SLO app will query 90 days of history from the raw metrics used to define the SLI. It then simulates many scenarios and provides a distribution of likely outcomes over the objective window. Hover your mouse over the histogram chart to see the probability of meeting your SLO for various values.

Select different target probabilities on the presented graph to adjust your SLO target and the likelihood to hit that target.

Note

At times predictions may not be able to be generated. In those instances will display a Error budget panel based on the provided query.

Currently ratio queries are only supported. Freeform queries that can be parsed as a ratio type will be supported in a future release.

Name the SLO

Give your SLO a name. You can also add an optional description or labels to the SLO to give it more context for searches and management.

Good SLO names, descriptions, and labeling practices are a critical part of SLO maintenance and management. A single sentence that is understandable to stakeholders clarifies meaning and adds value to the SLO. Consistent naming conventions make communication about your SLOs and searches easier.

SLO names identify your SLO in dashboards and alert rules.

  1. Add a name for your SLO.

    Make sure the name is short and meaningful, so that anyone can tell what it’s measuring at a glance.

  2. Add a description for your SLO.

    Make sure the description clearly explains what the SLO measures and why it’s important.

  3. Add SLO labels.

    Add team or service labels to your SLO, or add your own custom key-value pair.

Add SLO alert rules

SLO-based alerting prevents noisy alerting while making sure you’re alerted when your SLO is in danger. Add predefined alert rules to alert on your error budget burn rate, so you can respond to problems before consuming too much of your error budget and violating your SLO.

By default, Grafana SLOs are configured to use Grafana-managed alert rules. You can also choose to switch to data source-managed alert rules if you prefer.

SLO alerting rules create alerts based on the burn rate over different time windows. Burn rate is the rate that your SLO is consuming its error budget. A burn rate of 1 would consume all of your error budget (like 1%) over your entire time window (like 28 days). In this scenario, you would exactly meet your SLO. A burn rate of 10 would consume all of your error budget in one-tenth of the time window (like 2.8 days), breaking your SLO.

Fast-burn alert rules:

Over short time scales, Grafana sends alerts when the burn rate is very high. This alerts you for very serious conditions, such as outages or hardware failures. The fast burn alert triggers when the error budget will be used up in a matter of minutes or hours.

Slow-burn alert rules:

Over longer time scales, Grafana alerts when the burn rate is lower. This alerts you to ongoing issues that require attention. The slow burn alert triggers when the error budget will be used up in a matter of hours or days. If you decide to generate alert rules for your SLO to notify you when an event (like an outage) consumes a large portion of your error budget, SLO alert rules are automatically added with predefined conditions and are routed to a notification policy.

Note

When you add SLO alert rules, they are installed in Grafana Cloud Alerting in the stack where you create the SLO. The unmodified SLO name is included in the alert name.

To automatically add SLO alert rules, select the Add alert rules checkbox.

SLO alert rules are added once you save your SLO in the Review SLO step.

SLO alert rules are generated with default alert rule conditions.

Advanced Options

The options and features under the Advanced Options header are not required to build SLOs with either the UI or terraform. They are provided for specific conditions advanced users might want control of for their SLO.

Minimum Failures

Minimum Failures defines the minimum number of failure events (as defined by (success events - total events)) needed to occur in a window to trigger an alert to fire. To use this feature, you must have a defined SLI that parses as a ratio from the SLO app. It adds a new term to the promQL that restricts alerting until the number of events has been exceeded. It is applied to all alert rules, so the minimum window the MinFailures will be set to is 1hr.

Set MinFailures to 0 in the UI or in terraform to reset to default behavior. Alternately, you can remove MinFailures from terraform to reset to default behavior. To learn more about MinFailures, view the best practices here.

View alert rules

View the conditions, name, description, and labels for fast-burn and slow-burn alert rules.

Once you have saved your SLO, you can view your SLO alert rules in the Alert list view in the Alerting application. Here, you can view the status of your alert rules and if there are any firing alert instances associated with them.

For more information on the state and health of alert rules and instances, refer to View the state and health of alert rules.

Conditions

The fast-burn alert rule creates alerts under two conditions:

  • The burn rate is at least 14.4 x the error budget when averaged over the last 5 minutes AND over the last hour.
  • The burn rate is at least 6 x the error budget when averaged over the last 30 minutes AND over the last 6 hours.

The slow-burn alert rule creates alerts under two conditions:

  • The burn rate is at least 3 x the error budget when averaged over the last 2 hours AND over the last 24 hours.
  • The burn rate is at least 1 x the error budget when averaged over the last 6 hours AND over the last 72 hours.

These alert rules are designed so that alerts are created in response to either severe or sustained burn rate, without alerting for minor, transient burn rate.

Learn more about alerting on SLOs in the Google SRE workbook.

Name and Description

View the Name and Description fields.

Alert rule labels and annotations

When your SLO is created, a set of labels is automatically created to uniquely identify the alert rules for your SLO. Alerts will always include a grafana_slo_uuid, grafana_slo_window, and grafana_slo_severity label.

Custom labels can be added to SLO alert rules in the following ways:

  1. Any label added to the SLO will also be added to the alerting rule.
  2. Labels added to the fast-burn or slow-burn alerting configuration will override default label values or labels added to the SLO.

For example, setting the label team="frontend" on the SLO will mean that label is also added to the generated alerting rules. If you then set team="frontend-oncall" on the fast-burn alerting configuration, the label team="frontend-oncall" will be used instead of team="frontend". If you set a label grafana_slo_severity="high" on the fast-burn alerting configuration, it would replace the default grafana_slo_severity="critical" label.

Note that if you create a custom notification policy, you can use these labels as label matchers in the notification policy to control alert routing.

Custom annotations can also be added to both the slow-burn and fast-burn alert rules. Annotations will be attached to any alerts generated by these rules. For more information on alerting annotations, refer to the annotations documentation for alerts.

For more information on labels, refer to label matchers.

For more information on rule groups and namespaces, refer to Namespaces and groups.

View notification policies

When alerts fire, they are routed to a default notification policy or to a custom notification policy with matching labels.

Note

If you have custom notification policies defined, the labels in the alert rule must match the labels in the notification policy for notifications to be sent out.

Notifications are sent out to the contact point integrations, for example, Slack or email, defined as contact points in the notification policy. Email is the contact point in the default notification policy, but you can add contact points as required.

For more information on notification policies, refer to Manage notification policies.

Review and save your SLO

Review each section of your SLO and once you are happy, save it to generate dashboards, recording rules, and alert rules.

Note

If you selected the option of adding SLO alert rules, they are displayed here. They are not created until you save your SLO.