Grafana SLO: Easily predict the likelihood that you'll hit your target
Service-level objectives (SLOs) can be a great way to ensure you’re hitting your goals, but many software teams struggle to set realistic targets when they first set up the service-level indicators (SLIs) that underpin those efforts.
Sometimes management has a decree that all services will operate with “three 9s” of availability; other times engineers pick a number out of thin air. But you shouldn’t be selecting arbitrary figures or suggesting unrealistic expectations based on what others are doing. The whole point is to converge on a set of targets that serve as a quantifiable representation of reliable services and satisfied customers—customers who aren’t churning and leaving your service for your competitors'.
The classic guidance is to pick something and then iterate, but we think we (and you!) can do better. By leaning on the Grafana Machine Learning team’s statistical analysis experience and by assessing and iterating on your historical data, we can now give you a distilled view in Grafana SLO—our application that makes it easy to create, manage, and scale SLOs, SLO dashboards, and error budget alerts in Grafana Cloud—that helps identify the risk associated with a given target.
How we solve the guessing game
With some experience with statistics and a set of historical data, we can produce a distilled view of the risk associated with a given target.
Your boss may want you to achieve three 9s, but it’s hard to assess the probability that you’d hit that target or what the impact of an SLA breach might be—financial or otherwise. To quantify that risk, the statistical predictions feature (currently in beta) in Grafana SLO takess 90 days of historical data—queried from the metrics you specify when creating the SLO—and then we run some simulations on that data to produce a histogram of which simulations achieved what performance over a given month.
Note: We went with 90 days because it offers a nice balance of providing enough data to run simulations on and learn from while still being able to respond fairly quickly. While more data would be better, it would also slow down the predictions.
We calculate the risk of breaching your error budget by taking the result of the simulation and fitting it to a cumulative distribution function. To begin with we are using a Weibull Distribution that fits our data fairly well and is traditionally used in failure analysis.
By fitting a distribution, we can use the cumulative distribution function to easily calculate the probability of meeting a target and display a meaningful visualization to the user. As this feature matures we will continue to iterate on the models used to provide the most accurate predictions possible.
How to check your chances of meeting your objective
To get started, simply go to Grafana SLO and click on Manage SLOs > Edit SLO. Next, go to Set target and error budget, where you’ll see a graph that allows you to drag the vertical line along the x-axis with your cursor or click to drop the cursor on a given point along the x-axis. As you move your cursor, the text above the graph will update to show the probability of meeting that target.
From this view, you can eyeball where your desired performance might lie on the histogram and get a feel for whether your historical data supports your target. But the real goal is to empower service owners to make better decisions with the data—and we believe that quantifying the risk associated with a given target is a powerful way to think about it.
When you’re operating as a startup and rapidly trying to enter an emerging market, you might be OK with a 50% chance of missing a target service level. But as your product matures, you might assume a more risk-averse target and only accept a 5% chance of breaching an SLO each month.
Perhaps you have certain high value customers who require a more reliable service. Being able to assess the risk associated with a given service level objective allows you to make more informed business decisions in your SLA contracts, or to determine when special customers need to be split out of multi-tenant infrastructure to be able to provide a guaranteed level or service.
How to get the most out of your SLOs
SLOs can be incredibly powerful, but they’re still a relatively new endeavor for most teams. Now that we’ve shown you how to use this feature, let’s briefly discuss some additional best practices to keep in mind as you build out your goals and how Grafana SLO can help you get there.
Empowering your teams with the knowledge and confidence to believe they can actually achieve their targets is a noble endeavor, but it isn’t the end goal if it doesn’t align with your customer’s expectations. Customer satisfaction is the real target of any SLO—you want a proxy measurement that correlates with customer satisfaction.
ML can help you pick realistic targets on Day One, but you still need to iterate and bring user surveys and information from your customer support teams into the mix. It’s a chicken-and-egg kind of problem; most customers can’t tell you how many 9s they’ll be satisfied with, and most greenfield products can’t guarantee a minimum level of service. We can help you pick a realistic target, based on historical data, but you still have to iterate with your users and validate that your targets yield satisfied customers.
If your achievable target isn’t good enough and your customers are leaving for other offerings in the market, then you should raise your target and prioritize stability work to keep them from churning. If your target is too high, you might be spending too much time on reliability and you’re missing the opportunity to operate at a higher velocity with continuous deployment pipelines and canary deployment models. High velocity teams make for satisfied engineers and that impacts your bottom line, too. It’s all about having the data to balance the tradeoffs for your business.
Another hidden pitfall of high reliability systems is that most customers will begin to assume that the performance they are getting is the standard, even if it’s better than what they’ve actually been promised. There are published studies from Google about the benefits of intentionally degrading your service level to ensure that your customers don’t come to expect near-perfect performance.
You won’t often hear praise from customers for above-satisfied performance of your service, but you’ll hear the bad. If customer support escalations aren’t clearly reflected by error budget burn in your existing SLOs, you need to review whether you’re measuring the right things. If you’ve been over-delivering, you run the risk of angering your users with an outage, even if you’re within your error budget. You might schedule some end-of–month downtime before your users grow too accustomed to your too-high availability.
SLOs should impact software planning and inform future priorities. But you need to make sure they represent your customers’ needs.
What could come next for Grafana SLO
We’re excited about this new functionality, but we think this can be taken further to help users select an SLO target that produces an acceptable volume of alerts.
When using multidimensional SLIs (i.e., SLIs that “group by” cluster, for example), the alerts fire for individual dimensions, not for the combined SLO as is used for the simulation and target selection. Our current simulations don’t take into account realities where one cluster may be consistently underperforming others, but this is something we’d like to look into in the future.
And that’s not all! Accurately modeling the risk of breaching an SLO allows us to rethink how we use SLOs. For example, after an incident it is common to want to know if you are now at risk of exhausting your error budget or not. Using this simulation technique we can quantitatively answer this question and allow you to quickly pivot to prioritizing more reliability work if necessary, or perhaps delay risky features until the next period. There are also use cases for improving how we alert on SLOs to reduce noise and only alert when there is significant risk of error budget exhaustion.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!