Grafana Cloud

Best practices for Grafana SLOs

Because SLOs are still a relatively new practice, it can feel overwhelming when you start to create SLOs for the first time. To help simplify things, some best practices for SLOs and SLO queries are provided on this page.

What is a good SLO?

A Service Level Objective (SLO) is meant to define specific, measurable targets which represent the quality of service provided by a service provider to its users. The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements (SLAs) with customers, and sometimes they are implicit in customers expectations for a service.

Good SLOs are simple. Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.

A good SLO is attainable, not aspirational

Start with a realistic target. Unrealistic goals create unnecessary frustration which can then eclipse useful feedback from the SLO. Remember, this is meant to be achievable and it is meant to reflect the user experience. An SLO is not an OKR.

It’s also important to make your SLO simple and understandable. The most effective SLOs are the ones that are readable for all stakeholders.

Target services with good traffic

Too little traffic is insufficient for monitoring trends and can cause noisy alerts and irregularities can be reflected disproportionately with low-traffic environments. Conversely, too much traffic can mask customer-specific issues.

Team alignment

Teams should be the ones to create SLOs and SLIs, not managers. Your SLOs should communicate to you feedback for your services and the customer experience with them, so it’s good for the team to work together to create the SLOs.

Embed SLO review in team rituals

As you work with SLOs, the information they provide can help guide decision-making because they add context and correlate patterns. This can help when there’s a need to balance reliability and feature velocity. Early on, it’s good practice for teams to review SLOs at regular intervals.

Iterate and adjust

Once SLO review is a part of your team rituals it’s important to iterate on the information you have to be able to make continuously more informed decisions.

As you learn more from your SLOs, you may learn your assumptions don’t reflect practical reality. In the early period of SLO implementation, you may find there are a number of factors you hadn’t previously considered. If you have a lot of error budget left over, you can adjust your objectives accordingly.

Alerts and labels

SLO alerts are different from typical data source alerts. Because alerts for SLOs let you know there is a trend in your burn rate that needs attention, it’s important to understand how to set up and balance fast-burn and slow-burn alerts to keep you informed without inducing alerting fatigue.

Prioritize your alerts

Have your alerts routed first to designated individuals to validate your SLI. Send notifications to designated engineers through OnCall or your main escalation channel when fast-burn alerts fire so that the appropriate people can quickly respond to possible pressing issues. Send group notifications for slow-burn alerts to analyze and respond to as a team during normal working hours.

Use labels

Set up good label practices. Keep them limited to make them navigable and consumable for triage.

Grafana SLOs use two label types: SLO labels and Alert labels. SLO labels are for grouping and filtering SLOs. Alert labels are added to slow and fast burn alerts and are used to route notifications and add metadata to alerts.

Minimum Failures

To reduce alert fatigue a team may want to set a minimum number of failures (as defined by (success events - total events)) before an alert is triggered. This is most common for SLIs built on processes that have heavily periodic or spiky traffic where the low traffic rates make alerting rules unreliable.

The ideal solution for low-traffic is to supplement your traffic flow with synthetics to ensure you always get a clear signal on whether your failure events represent an issue.

If you are unable to use synthetics, you can choose to change the Minimum Failures advanced feature number. This number is applied to all your alerting time windows for the SLO, the smallest of which is 1 hour. This means that, if your service never gets enough traffic to exceed the Minimum Failures number, it won’t trigger an alert.

Query tips and pitfalls

There are many approaches to how you configure your SLO queries. Ultimately, it all depends on your needs. Ultimately, just remember: if you don’t have metrics that represent your user’s experience then you need new metrics.

Keep queries simple

The best SLIs are based on Prometheus counter metrics (such as monotonically increasing series) and use labels to encode the counted event as either a success or failure (for example: requests_total{code=”200”}). If your metrics don’t look like this, it’s probably better to reinstrument your service with well-suited metrics than to try and work around the issue with complex SLI query definitions.

Availability and Latency are the most common SLOs to start with for request driven services. For example:

Availability (non-5xx responses): requests_total{code!~”5..”} / requests_total
Latency (less than 1 second): requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”}

Freshness is a common SLO for message queues or batch processes where you want to ensure that each item (perhaps after several retries) gets completed before the work request grows too stale.

Freshness (work spent less than 120 sec in queue): completed_duration_seconds_bucket{le=”120”} / completed_duration_seconds_count

Advanced SLIs

Define advanced SLIs as a “success/total” ratio for best dashboards. The “Ratio” SLO type enforces this success/total style, but you’ll get more dashboard features if you follow the same approach with your advanced SLOs.

Do <success rate> / <total rate>
Avoid: 1 - (<failure rate> / <total rate>)

If you can’t reinstrument your metrics to encode success/failure with labels and you must work with failure_total and all_total counters, you can do (total - fail) / total. For example: `( sum by (…) (rate(all_total[$__rate_interval]))

sum by (…) (rate(failure_total[$__rate_interval])) ) / sum by (…) (rate(all_total[$__rate_interval]))`

Know your SLIs

There are many SLI types. A brief explanation of Multidimensional and Rollup SLIs follows below.

Multidimensional SLI

A Multidimensional SLI reports a ratio for each value of a given label. for example: sum by (cluster) (rate(<success>[5m])) / sum by (cluster) (rate<total>[5m])). When you specify “group by” labels on the ratio SLO type, it makes it a multidimensional SLI. A common use is to specify cluster and/or namespace in the grouping. Multidimensional SLIs enables per-cluster alerting and supports more flexible dashboards where you can include or exclude values for the chosen dimension labels (see rollup SLI below).

Rollup SLI

A rollup SLI (or aggregated SLI) is a calculation of a multidimensional SLI where the numerator and denominator is further aggregated before the final ratio calculation. When you select cluster=all on the dashboard of a multidimensional SLO that defined cluster as a group label, the dashboard calculates the aggregate ratio of the sum of all successes/over sum of all requests. This provides alerting on each cluster and reporting on the rollup overall results.

Additional reference materials

Google provides documentation on SLOs in their SRE Book. They also provide useful guides on SLO implementation and alerting on SLOs.

Feedback

Best practices for Grafana SLOs

What is a good SLO?