Introduction to Grafana SLO
With Grafana SLO, you can create, manage, monitor, and alert on your SLOs.
- Learn about SLOs interactively with in-app guidance and this documentation
- Set up your SLOs in Grafana using the UI or API
- From templates (via Integrations, starting with K8s)
- From your own queries (first Prometheus then from any data source using Recorded Queries)
- Iterate on the target and details in order to make the SLOs work. Update your SLI, evaluation windows, and targets before deploying alerts.
- Publish your SLO to:
- Add it to a central dashboard that shows the status of all services
- Create a service-specific dashboard
- Create a set of multi-window, multi-burn alert rules
There are two key components to the Grafana SLO framework.
Service Level Indicators (SLIs)
A key performance metric, like availability. These are the metrics you measure over time that inform you about the health or performance of a service. It expresses actual results as a percentage, for example, 99.99% system availability last month. Grafana SLO expresses the SLI percentage as “good” events divided by “all” events. Choosing the definition of “good” is an art that we will help with.
Service Level Objectives (SLOs)
The target an SLI ought to achieve. You define what an acceptable level of service is. In our example, this would be the percentage of error responses that are acceptable within a given time frame, so that a customer doesn’t notice a degradation in the service you are providing: 99.9% of requests to a web service return without server errors over a month.
When defining your SLOs, it is important to remember that you are not aiming for 100%. The cost and complexity of availability gets higher the closer you get to 100%, so it’s important to factor in a margin of error to your target, known as the error budget.
To highlight how this works, let’s use an example of a credit card processing app. The company has 99.99% availability written into contracts, but that doesn’t really give a clear picture of what their customers really expect from their service.
Using the SLO framework, this company can be more specific about their availability goals. The SLO in this case would be that they want 99.99% of requests to validate a credit card to return without a 500 error in less than 100ms. Validation should be instantaneous, because e-commerce websites need to show customers if they mistyped a number before a customer hits the “buy” button. The SLI in this case would be the % of requests to validate a credit card return without errors in less than 100ms.
Some other key concepts to be familiar with when implementing your SLO strategy are:
Error budgets allow for a certain amount of failure within an SLO. It is a measurement of the difference between actual and desired performance. Using the example from above; the difference between perfect service (100%) and the service level objective (99.9%).
In this case, the error budget can be measured as a percentage (like 0.01% failures) or an amount of time (43 minutes of non-compliance per 28 days).
Burn rate is the rate at which your service is running out of error budget, which is the amount of imperfection you’re okay with in your service.
By setting an SLO of 99.5%, you’re saying it’s okay for 0.5% of requests to return errors or take longer than 500ms. If you have a constant error rate of 0.5%, your service will completely run out of error budget in 28 days. That’s a burn rate of 1. Slower burn (like 0.75) is good! That means you’re beating your SLO. Faster burn is bad - it implies you’re providing lower-quality service than your users expect, and you should do something about it.
Alert on your burn rate
SLO alert rules trigger alerts when you’re in danger of using up the error budget in your SLO timeframe. This ensures support teams are only notified when an issue impacts their business objectives and not each and every time a monitored resource or process breaches a set threshold.
Grafana generates both fast and slow burn rate alerts, because you will probably want to react differently if your service is slowly burning the error budget (e.g. just open a ticket if a bug has increased your error rate) vs. quickly burning the error budget (e.g. notify the on-call engineer for a regional outage).
For example, if you’re burning error budget at a rate of 2% per hour (in our case, that would be like an error rate of X%), Grafana triggers an alert, which should page an on-call engineer using a tool like Grafana OnCall. This catches urgent events, like outages or hardware failures.
If you’re burning error budget at 0.8% per hour of your error budget, Grafana sends a less-critical alert, intended for you to open a ticket in Jira, ServiceNow, Github, or another ticketing system. This catches less-urgent events, like bugs or network slowdowns.
Fast-burn alert rule:
Over short time scales, Grafana sends alerts when the burn rate is very high. This alerts you for very serious conditions, such as outages or hardware failures.
Slow-burn alert rule:
Over longer time scales, Grafana alerts when the burn rate is lower. This alerts you to ongoing issues that require attention.
Design your SLOs
When designing your SLOs, think about:
What level of service do your customers expect?
The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements with customers, and sometimes they are implicit in customers’ behavior. How quickly do your customers expect responses to their queries? Is it acceptable for them to retry a query if it fails? Do they care specifically about response accuracy?
Which metrics best reflect user behavior and service delivery?
Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.
Which values do you want these metrics to signify good versus bad?
Allow for an error budget-a rate at which the SLO can be missed-to track over time.
How will you react if you can’t meet the expected service target?
Iterate and evaluate over time whether your targets are being met. If you have a lot of error budget left over, you can adjust your objectives accordingly.