Best Practices

Grafana Cloud

Best practices for Grafana SLOs

Because SLOs are still a relatively new practice, it can feel overwhelming when you start to create SLOs for the first time. To help simplify things, some best practices for SLOs and SLO queries are provided on this page.

What is a good SLO?

A Service Level Objective (SLO) is meant to define specific, measurable targets which represent the quality of service provided by a service provider to its users. The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements (SLAs) with customers, and sometimes they are implicit in customers expectations for a service.

Good SLOs are simple. Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.

A good SLO is attainable, not aspirational

Start with a realistic target. Unrealistic goals create unnecessary frustration which can then eclipse useful feedback from the SLO. Remember, this is meant to be achievable and it is meant to reflect the user experience. An SLO is not an OKR.

It’s also important to make your SLO simple and understandable. The most effective SLOs are the ones that are readable for all stakeholders.

Target services with good traffic

Too little traffic is insufficient for monitoring trends and can cause noisy alerts and irregularities can be reflected disproportionately with low-traffic environments. Conversely, too much traffic can mask customer-specific issues.

Team alignment

Teams should be the ones to create SLOs and SLIs, not managers. Your SLOs should communicate to you feedback for your services and the customer experience with them, so it’s good for the team to work together to create the SLOs.

Embed SLO review in team rituals

As you work with SLOs, the information they provide can help guide decision-making because they add context and correlate patterns. This can help when there’s a need to balance reliability and feature velocity. Early on, it’s good practice for teams to review SLOs at regular intervals.

Iterate and adjust

Once SLO review is a part of your team rituals it’s important to iterate on the information you have to be able to make continuously more informed decisions.

As you learn more from your SLOs, you may learn your assumptions don’t reflect practical reality. In the early period of SLO implementation, you may find there are a number of factors you hadn’t previously considered. If you have a lot of error budget left over, you can adjust your objectives accordingly.

Keep SLI queries simple

There are many approaches to how you configure your SLI queries. Ultimately, it all depends on your needs.

If your metrics aren’t suited for event-based SLIs or don’t reflect the user experience, update your service instrumentation rather than working around with complex SLI queries.

Tips for creating effective SLIs:

Start with availability and latency. These are the most common SLO types for request-driven services. See the availability examples and latency examples.
Test from the user’s perspective. Consider using Synthetic Monitoring probes from one or more geographic locations where your users are.
Decide between aggregated and multidimensional SLIs, or both. Multidimensional SLIs let you filter consumption and set alerts for each dimension (for example, every cluster or probe). See the multidimensional example.
Use event-based SLIs over time-based SLIs. Event-based SLIs are generally more accurate, and Grafana SLO doesn’t support all dashboard views for time-based SLIs. Learn more on the SLI calculation types section.
Avoid percentiles in SLIs. SLIs should count all requests meeting the target, not just a percentile (for example, “p95 latency under 2s”). Percentiles from histograms can also be inaccurate. See the latency examples.

Alerts and labels

SLO alerts are different from typical data source alerts. Because alerts for SLOs let you know there is a trend in your burn rate that needs attention, it’s important to understand how to set up and balance fast-burn and slow-burn alerts to keep you informed without inducing alerting fatigue.

SLO alerting system configuration and migration guide

By default, Grafana SLOs are configured to use Grafana-managed alert rules. You can also choose to use data source-managed alert rules if you prefer.

To change your alerting system preferences for SLOs

Go to Administration > Plugins and data > Plugins and click on the SLO card.
Click Configuration and select your desired alerting system from the dropdown menu.
Click Save to apply the changes.

When you change your alerting system, the alert rules transfer from one alerting system to the other. All SLOs use the alerting system you select. When switching from one alerting system to another, make sure that you have configured Notification Policies and Contact Points in the new system’s Alertmanager to avoid missing alert notifications. Alerts are routed by matching their labels to Notification Polices, which send notifications to Contact Points. Learn more about the different AlertManagers and about how to configure alert notifications.

Note that only the alert rules transfer - the recording rules remain as datasource managed recording rules.

Note
SLO data source-managed alert rules are stored alongside the recording rules in a Grafana Mimir namespace called grafana_slo_<STACK ID> where the stack ID refers to the stack in which the SLO plugin is running. This enables you to quickly search for and uniquely identify SLO alert rules.

Prioritize your alerts

Have your alerts routed first to designated individuals to validate your SLI. Send notifications to designated engineers through OnCall or your main escalation channel when fast-burn alerts fire so that the appropriate people can quickly respond to possible pressing issues. Send group notifications for slow-burn alerts to analyze and respond to as a team during normal working hours.

Use SLO alerts to trigger Grafana OnCall

If you’ve configured OnCall for Grafana Alerting, you may want to forward SLO alerts to Oncall as well. To forward alerts to OnCall by adding a contact point with a webhook on the “ngalertmanager” alert manager.

Use labels

Set up good label practices. Keep them limited to make them navigable and consumable for triage.

Grafana SLOs use two label types: SLO labels and Alert labels. SLO labels are for grouping and filtering SLOs. Alert labels are added to slow and fast burn alerts and are used to route notifications and add metadata to alerts.

Minimum Failures

To reduce alert fatigue a team may want to set a minimum number of failures (as defined by (success events - total events)) before an alert is triggered. This is most common for SLIs built on processes that have heavily periodic or spiky traffic where the low traffic rates make alerting rules unreliable.

The ideal solution for low-traffic is to supplement your traffic flow with synthetics to ensure you always get a clear signal on whether your failure events represent an issue.

If you are unable to use synthetics, you can choose to change the Minimum Failures advanced feature number. This number is applied to all your alerting time windows for the SLO, the smallest of which is 1 hour. This means that, if your service never gets enough traffic to exceed the Minimum Failures number, it won’t trigger an alert.

Troubleshoot SLOs

Troubleshoot SLO data source issues

Misconfigured data source permissions can cause SLO calculation errors or no data metrics. Troubleshoot SLO data source permission issues:

Verify that the Cloud Access Policy has the correct permissions.
Verify that RBAC is configured appropriately.

Troubleshoot SLI above 100%, no data, or other inconsistencies

If you have arrived here via an annotation in your SLO dashboard, this is because the SLI calculation is being clamped to no greater than 100%. SLI calculations > 100% can happen due to the reasons described below. If you want to hide these annotations, you can do so by toggling the “Show SLO Annotations” button.

Some common SLO source query patterns can result in SLI calculation issues, such as SLIs above 100%, “no data”, or other SLI inconsistencies. The following sections describe common SLI calculation issues and how to resolve or remediate them.

Source query returns above 100%

SLO source queries should generate values within the range of 0-1.

Redefine the source query if it produces values outside of the range of 0-1. Reference query tips for more information about correctly configuring SLO source queries.

Source query compares different metrics

Ratio queries that use different metrics in the numerator and the denominator can experience synchronization issues when the numerator and denominator are incremented at different intervals.

For example, this query uses the http_success_total metric in the numerator and the http_requests_total metric in the denominator, which is susceptible to reporting > 100% when the metrics are ingested at different times.

rate(http_success_total) / rate(http_requests_total)

Use a single metric with series identified by labels to avoid synchronization issues if possible. The metrics ingester synchronizes series belonging to the same metric within a single scrape endpoint during ingestion.

rate(http_requests_total{code="200") / rate(http_requests_total).

Note
The metrics ingester also synchronizes data types that natively generate multiple metrics, such as histograms. For example the Prometheus client library synchronizes these metrics:
http_request_duration_seconds_bucket
http_request_duration_seconds_count
http_request_duration_seconds_sum

Series aggregated by Adaptive Telemetry

Adaptive Telemetry can cause significant impacts on recording rule values, which can lead to unexpected irregularities in recorded SLI data.

Exclude metrics used in SLI queries from Adaptive Telemetry aggregation.

Series created by recording rules

The ruler executes recording rules with varying latency. SLI series may be inaccurate or report over 100% when the numerator and denominator of the ratio query come from different recording rules. The ruler can cause synchronization issues when it evaluates and writes series in the numerator and denominator at different latencies or intervals.

Query the source series instead of the recorded series if possible to generate accurate SLI calculations.

Sampled series

Downsampled series can have inconsistent SLI values when the success and total series are not evenly sampled by the metrics ingester. For example if more success events are sampled than total events, then the SLI can measure over 100%. One common scenario where this happens is when using series derived from sampled traces.

Avoid using sampled series in the SLO query if possible.

Newly initialized counter series

Counter series that transition from being absent to being present cause discrepancies in the SLI calculation when the series’ initial value is not zero. The rate function requires two data points. The increase recorded by the initial transition from the series being absent to present is lost when the initial value is not zero. This causes the SLI metric to be incorrect.

This source of error is exacerbated by many short lived series being included in the SLO. Ensure counter series used by the SLO query are initialized with a 0 value or avoid the inclusion of short lived series in the SLO query if possible. It may not be practical to initialize each labeled series to zero, such as series identified with HTTP status codes, but initializing the most common value from each “success” and “total” class may be enough:

status=200
status=500

Late-arriving series

Series samples might not arrive and become queryable until well after the timestamps in the samples themselves when network issues or problems in the collection environment introduce latency to the ingestion pipeline. The SLO creation wizard configures SLO queries with a metrics offset to handle this scenario. However, metrics written outside of the offset can cause synchronization issues. For example, synchronization issues can occur if Mimir ingests metrics from more than two minutes ago when the SLO query is configured to use offset 2m.

In certain situations such as SLOs based on late-arriving Amazon CloudWatch metrics, you may need to configure a greater offset. The advanced SLO type parses a user-provided offset and add two minutes to it. If you wanted to achieve a five-minute offset, you could use the advanced query below. Verify the resulting offset in the recording rules by clicking “View Alerts” on the Manage SLOs page and viewing the recording rules.

sum (rate(request_total{status="success"}[$__rate_interval] offset 3m)) / sum (rate(request_total[$__rate_interval] offset 3m))

Note: It’s important to track metrics for ingestion latency when using a self-hosted data source, because metrics ingestion latency can increase the number of late arriving metrics and impact SLO evaluation.

Out-of-order series

Mimir has an out_of_order_time_window feature that can cause synchronization issues for SLO recording rules. Refer to Recording rules for ingesting out-of-order samples for information about out_of_order_time_window configuration.

Set the SLO offset equal to or greater than the out_of_order_time_window when this feature is enabled.

Additional reference materials

Google provides good introductory documentation on SLOs in their SRE Book. They also provide useful guides on SLO implementation and alerting on SLOs.

Was this page helpful?

Email docs@grafana.com

Help and support

Community

Best practices for Grafana SLOs

What is a good SLO?

A good SLO is attainable, not aspirational

Target services with good traffic

Team alignment

Embed SLO review in team rituals

Iterate and adjust

Keep SLI queries simple

Alerts and labels

SLO alerting system configuration and migration guide

Prioritize your alerts

Use SLO alerts to trigger Grafana OnCall

Use labels

Minimum Failures

Troubleshoot SLOs

Troubleshoot SLO data source issues

Troubleshoot SLI above 100%, no data, or other inconsistencies

Source query returns above 100%

Source query compares different metrics

Series aggregated by Adaptive Telemetry

Series created by recording rules

Sampled series

Newly initialized counter series

Late-arriving series

Out-of-order series

Additional reference materials

Was this page helpful?

Related resources from Grafana Labs