Introduction to Grafana SLO
With Grafana SLO, you can create, manage, monitor, and alert on your Service Level Objectives.
- Learn about SLOs interactively with in-app guidance and this documentation.
- Set up your SLOs in Grafana using the UI, Terraform, or the API.
- Use our query builder to easily make ratio-based SLIs.
- Create an SLI from any PromQL query.
- Iterate on the query and objective to build a high-quality SLO. Update your SLI, evaluation windows, and objectives before deploying alerts.
- Use multi-dimension SLIs
- Look back at your SLO’s performance over time, including before the SLO was created.
- Generate multi-window, multi-burn rate alerting rules to ensure you’re notified at the right time.
- Automatically generate a dashboard for each SLO that can drill down into the metrics to help investigate alerts.
There are two key components to the Grafana SLO framework.
Service Level Indicators (SLIs)
A key performance metric, like availability. These are the metrics you measure over time that inform you about the health or performance of a service. It expresses actual results as a fraction. For example, 99.9% system availability, or 0.999. Grafana SLO can help create a high-quality SLI using our ratio query builder, and can also support any custom PromQL query for an SLI.
Service Level Objectives (SLOs)
The target an SLI ought to achieve. You define what an acceptable level of service is. In our example, this would be the percentage of error responses that are acceptable within a given time frame, so that a customer doesn’t notice a degradation in the service you are providing: 99.9% of requests to a web service return without server errors over 28 days.
When defining your SLOs, it is important to remember that you are not aiming for 100%. The cost and complexity of availability gets higher the closer you get to 100%, so it’s important to factor in a margin of error to your target, known as the error budget.
To highlight how this works, let’s use an example of a credit card processing app. The company has 99.95% availability written into contracts, but that doesn’t really give a clear picture of what their customers really expect from their service.
Using the SLO framework, this company can be more specific about their availability goals. The SLO in this case would be that they want 99.97% of requests to validate a credit card to return without a 500 error in less than 100ms. Validation should be instantaneous, because e-commerce websites need to show customers if they mistyped a number before a customer hits the “buy” button. The SLI in this case would be the % of requests to validate a credit card return without errors in less than 100ms.
Some other key concepts to be familiar with when implementing your SLO strategy are:
Error budgets allow for a certain amount of failure when measuring the performance of a service. It is a measurement of the difference between actual and desired performance. Using the example from above; the difference between perfect service (100%) and the service level objective (99.97%).
In this case, the error budget can be measured as a percentage (like 0.03% failures) or an amount of time (12 minutes of non-compliance per 28 days).
Burn rate is the rate at which your service is running out of error budget, which is the amount of imperfection you’re okay with in your service.
By setting an SLO of 99.5%, you’re saying it’s okay for 0.5% of requests to return errors or take longer than 500ms. If you have a constant error rate of 0.5%, your service will completely run out of error budget in 28 days. That’s a burn rate of 1. Slower burn (like 0.75) is good! That means you’re beating your SLO. Faster burn is bad - it implies you’re providing lower-quality service than your users expect, and you should do something about it.
Alert on your burn rate
SLO alert rules trigger alerts when you’re in danger of using up the error budget in your SLO timeframe. This ensures support teams are only notified when an issue impacts their business objectives and not each and every time a monitored resource or process breaches a set threshold.
Grafana generates both fast and slow burn rate alerts, because you will probably want to react differently if your service is slowly burning the error budget (e.g. just open a ticket if a bug has increased your error rate) vs. quickly burning the error budget (e.g. notify the on-call engineer for a regional outage).
For example, if you’re burning error budget at a rate of 2% per hour (in our case, that would be like an error rate of X%), Grafana triggers an alert, which should page an on-call engineer using a tool like Grafana OnCall. This catches urgent events, like outages or hardware failures.
If you’re burning error budget at 0.8% per hour of your error budget, Grafana sends a less-critical alert, intended for you to open a ticket in Jira, ServiceNow, Github, or another ticketing system. This catches less-urgent events, like bugs or network slowdowns.
Fast-burn alert rule:
Over short time scales, Grafana sends alerts when the burn rate is very high. This alerts you for very serious conditions, such as outages or hardware failures.
Slow-burn alert rule:
Over longer time scales, Grafana alerts when the burn rate is lower. This alerts you to ongoing issues that require attention.
Design your SLOs
When designing your SLOs, think about:
What level of service do your customers expect?
The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements with customers, and sometimes they are implicit in customers’ behavior. How quickly do your customers expect responses to their queries? Is it acceptable for them to retry a query if it fails? Do they care specifically about response accuracy?
Which metrics best reflect user behavior and service delivery?
Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.
Which values do you want these metrics to signify good versus bad?
Allow for an error budget-a rate at which the SLO can be missed-to track over time.
How will you react if you can’t meet the expected service target?
Iterate and evaluate over time whether your targets are being met. If you have a lot of error budget left over, you can adjust your objectives accordingly.
SLO usage and billing
SLOs create new data points every 60 seconds, or 1 Data Point per Minute (DPM) of Prometheus metrics. Each SLO creates 10-12 Prometheus recording rules, where each recording rule will create one or more series depending on the provided grouping labels. If the output of your SLI query has very high cardinality, an SLO will create many new series.
For more information about metrics and how to manage DPM, see refer to the Grafana Cloud metrics optimization docs.