Service level objectives: How SLOs have changed the business of observability
Forget the latest tech gadgets and the newest products. One of the most talked about trends in observability right now?
“SLOs have really become a buzzword, and everyone wants them,” said Grafana Labs principal software engineer Björn “Beorn” Rabenstein on a recent episode of “Grafana’s Big Tent,” our new podcast about people, community, tech, and tools around observability. “I see that we have conferences by that name and everything, so it’s really popular.”
So much so that we dedicated an entire episode of “Grafana’s Big Tent” to “Why SLOs are MVPs in observability.” For this deep-dive discussion, Parca software engineer Matthias Loibl, who co-created the open source SLO software Pyrra with Nadine Vehling, and Grafana Labs’ Rabenstein joined our hosts Mat Ryer and Tom Wilkie, who credits SLOs for changing the business of Grafana Labs.
“I would admit that I was relatively naive about the whole error budget SLO a few years ago. Then Björn joined and showed me the way,” Wilkie said. “It’s just been so transformational for the culture of how we measure our performance at Grafana Labs. And I’m just so excited to see projects like Pyrra to make this available to more people.”
As long as the new adopters are equally eager but reasonable with their expectations. “You won’t be a hundred percent perfect. And that applies here as well,” said Loibl.
“It’s fairly easy to convince the organization to adopt [SLOs], but then they think it’s a magic wand and will fix everything,” added Rabenstein. “The alerts are suddenly two orders of magnitude less noisy. You can talk about so many things in a more meaningful way. Your resource planning might be more informed. I would really see it as a tool that will give you a lot of earlier returns, but then you have to iterate on it.”
Note: This transcript has been edited for length and clarity.
Service level objectives (SLOs): The basics
Tom Wilkie: What are SLOs? What does it stand for? Why are these things important?
Björn “Beorn” Rabenstein: SLO stands for service level objective, which you could claim explains everything. But very generally, this is where you have an idea of which kind of quality of service you will provide. That’s often linked to uptime, but as we will see, in modern systems uptime is not as easy to define; it goes into error rates, error budget.
So it’s the idea that you are not just trying your best to run your service and make it available always. That’s actually an impossibility. You start to talk about what’s your service level? How much of your service are you able to provide, on average, in a certain timeframe?
Matthias Loibl: The “certain timeframe” aspect is really important to me, because every other alerting style out there kind of looks at, “Hey, what’s the error rate in the past 5 or 10 minutes?” and then if it’s above a certain threshold, it will alert. But with service level objectives you kind of look at the broader scheme of “What is my service going to be like?” or “What is it supposed to be like over four weeks?” Then you start working towards that objective.
Tom: So what’s an example of an SLO? How would you phrase an SLO?
Matthias: The example I like to use always is a simple one: You have a website that has a landing page, and you want that landing page to be available always. That’s the goal. It’s not a real-world goal, so you need to define how many errors you can have when serving that landing page. Maybe 1% of the times users go to your landing page, it’s fine to show a 500 error, and it’s still good enough for 99% of the other people.
Mat Ryer: So if a non-technical CEO comes in and says “No, I don’t want there to ever be any errors,” what would you say to somebody in that position?
Matthias: Yeah, good luck with that … I think there are so many variables at play. Obviously, technical people know about DNS being the problem always, but there are so many other things. And then there are aspects in the request-and-response type of systems we have, that you don’t even control. We might not even be able to manage it. Maybe there’s an error on the client-side as well, and you have to take this into account as well. So there will be errors, and it’s just how many are acceptable.
Tom: One of the things I’ve noticed in this industry is there’s a big push for related topics, like continuous delivery. “We should be deploying new versions of our software daily. Hourly.” I’ve found SLOs are a great way of trading off the ability to move quickly — to move fast and break things, the famous Facebook saying — versus the ability to provide a reliable service.
So once you admit, “Okay, 0.1% of requests are allowed to fail in a month,” suddenly that’s the other side of the, “Do I release this feature today with the risk that it might break things?” You can measure both of these; you can measure your release philosophy, and you can measure your SLO performance. If you’re doing worse in your SLO, maybe slow down releasing.
Beorn: SLOs are what Tom just said: They’re a very good way of objectively finding out, “Are we moving too fast and breaking too much? Or are we actually moving too slow, and nothing really happens, but we’re also not innovating fast enough?” Now you have a way of talking between all those different stakeholders to find a good middle ground and how quickly to move.
Request vs. uptime service level objectives
Tom: When I’m selling a piece of software to someone in procurement, and they say, “I want this to be available 99.9% of the time,” and I’m like “I’ll make 99.9% of your requests succeed,” it’s like we’re talking two different languages. How do you convince them that you are trying to give them something that’s more friendly to them?
Beorn: It really depends on the kind of service. There are services that are consumed by machines; there are services that are consumed by humans; and then there are services where you have a contractual obligation to fulfill a certain service level objective, which is then called an SLA (service level agreement). Then there are services where you don’t have that — like it’s a free mail service or something, and you make all your money from serving ads so then you don’t have a contractual obligation to your users, but you still want your users to be happy and not run away.
This is the key to how to design an SLO: You have to look at how the service is consumed, and how the users are perceiving it; either intuitively, or really sometimes it’s legally. The reality is that sometimes what you have in your contract is more important than what your users think. But that, again, depends on the context.
There are situations where an uptime-based SLO is exactly the right thing, there are times where a request-based SLO is the right thing. Or sometimes you have a mix of things or you have to come up with something completely new.
“[SLOs] are a very good way of objectively finding out, ‘Are we moving too fast and breaking too much? Or are we actually moving too slow, and nothing really happens, but we’re also not innovating fast enough?’ "
— Björn “Beorn” Rabenstein
Setting up the “right” SLOs
Tom: I’d really like to dive into more detail about how you build these SLOs. What techniques and technology you can use to make this whole process easier. But before we do, how do I know what the right number is? Is it 80%? Is it 90%? Is it 99.999%? How do I know when I’ve got the right SLO?
Mattias: That comes up a lot when people first start out. What I always recommend is if you have the data available for your current system, use the data, measure what the current uptime was over the last month or so, and then use that as the foundation to base your objective on for the coming months. They are never set in stone, so you can always adjust them and refine. That should give you a really nice foundation to continue.
Beorn: I think in the ideal world, that’s totally valid, what Matthias just said. But I think to set a contrast here, in the ideal world you would never just look at your last month’s performance and adjust your SLOs. It makes sense in practice, to do product research and know exactly what your product should be delivered with, which service level you want to deliver your product to have the best outcome for your customers. Then you have an idea of how expensive it would be to make it more reliable, and how much happier are your customers.
Tom: The way we came up with our query latency SLO for our Grafana Cloud Metrics service was we effectively did what you said, Matthias: We looked at historical performance. We were like “Yeah, that’s kind of what it is. We’ll stick to that.” But then realistically, when we tried to sell it to book our first large six-figure deal on our Grafana Cloud Metrics service, the customer wouldn’t accept that SLO. And that was really the process of improving query performance until they were like, “Yes, it’s good enough. Now we’ll sign the contract.” That was where our current SLO got locked in, got set in stone. It was the first six-figure deal.
Mat: That reminds me of building products in general. Ideally, there’s some of it you’re doing yourself and some guesswork and some assumptions. But the best information you get is from real users, from people that are actually going to end up either buying it or using it.
Error budgets: What are they and why do they matter
Mat: Can we dig into this a little bit— what is an error budget?
Beorn: An error budget is if you want an inverted SLO. Or let’s say you have a specific kind of SLO, which is based on success rates. You’ve promised the customer you will serve 99.9% of the requests correctly and in time. Then the inverse of that, the 0.1% you have left is your error budget. Now you need a billing period, which if you have an SLA, is nicely formulated in your contract. If you just have users that come and go, because you serve a free product that just makes money with ads, it’s not that formalized. But you might still want to have a billing period, which is often a month.
Then you get into this idea that you burn your error budget. If you have an outage one week into the month, and a certain number of requests have failed, then you know you’ve burned 20% of your error budget, but you are also already 25% into the month, so that’s fine. You burned your error budget at the right rate. Then if you burn it too quickly, you can start to say “Okay, let’s act a bit more cautiously. Let’s not do this risky new feature launch this month.”
Tom: So to make this a bit more real — and I will admit to not fully understanding the expressions and the maths behind the error budget alerting that Björn has implemented or that Matthias’s tool implements — but when we started offering this SLA on our Grafana Cloud Metrics service at Grafana Labs, we agreed 99.5% of requests complete within a couple of seconds. So we built an alert that said, “In a 5-minute moving window, if more than 0.5% of requests are slower than a couple of seconds, page us.” That seems like the obvious thing to do, right?
So we built that alert, and it fired — not all the time, but multiple times a week, multiple times a day sometimes. We would scramble to scale the service up, to diagnose whatever issue it was, and generally put a lot of effort into optimizing that. And yet, at the end of every month, when we went and ran a report to say what was our 99.5th percentile latency over the last month, it always came back 200 milliseconds. I’m like, “How can these two things be true?” Like, we are way below the SLO we agreed with the customer, we’re well inside our SLA, and yet we’re getting paged multiple times a day.
“The adoption of SLOs and the high-quality alerts that come with them have been one of the more profound things that have come out of this work.”
— Tom Wilkie
Beorn: Yeah, that’s the core of the issue. If your billing period is five minutes, then your 5-minute sliding window is precisely right. This is when you promise the customer, “Every five minutes, we’ll always know we’ll never have more than 0.5% of errors.” But if you have a billing period of a month, which for the systems we have and usually for the requirements we have makes much more sense, you can say an average over a month so that allows me to have a 5-minute or 10-minute or even if it’s a complete outage, which is rare with our systems but could happen, that’s still okay, if the rest of the month is totally fine. But that, again, depends on your product, on your users, and on your contracts.
But that’s the important thing: Having a not noisy SLO. Every 5-minute window is less than 0.5% errors is usually very expensive to get, and it’s also not what users require in most of the cases. And that’s where you get into averaging, and that’s where the alerting also gets more complicated in order to not make it noisy.
Tom: It was night and day when Björn put this in place for our metrics service. We went from paging every few days, sometimes multiple times a day, to only paging every few weeks for this particular alert. And when it did page, there was an actual issue that we could solve.
Every other team [at Grafana Labs] very quickly was like, “Oh, we want that,” and they copy-pasted the alert rules that Björn built and implemented them for their service. And we saw a huge reduction in our on-call load.
Mat: I was gonna ask, how should teams do this if you don’t have a Björn to come in and just do this for you?
Tom: Does everyone not have a Björn?
Mat: I don’t think everyone yet has a Björn. There’s been scaling issues in the roll out. We’ve not been able to roll out enough Björns. So unfortunately no, not everyone has one. So how do they find out this?
Tom: If everyone doesn’t have a Björn, and we haven’t perfected human cloning yet, I think that’s a good segue for Matthias’s project, Pyrra. Matthias, can I use Pyrra to do this for me?
Mattias: That’s exactly what we’ve built Pyrra for. We started out looking at who are the personas that use SLOs, and we identified that most often it’sSREs, no big surprise. Then we actually started interviewing a handful of people and had a long conversation with them, trying to figure out, “Have you used SLOs? If not, why haven’t you used them?” so on and so forth. So we tried to really figure out what are the barriers to entry to using SLOs. Out of that came the Pyrra project, which is now able to give you a high-level configuration file where you can put in your objective and your time window. Then an indicator, or service level indicator (SLI) that uses a Prometheus metric and does all the PromQL queries for you, creates the multiple burn rate alerts for you. So all of that is taken care of, and you don’t really have to think about anything else but the high-level things we were talking about, like “What is my objective? What is the time window we’re talking about?”
It is a custom resource definition or CRD-based configuration, but it also works outside of Kubernetes. Every SLO, is a configuration file; once you’ve written that one, it gets loaded into either a Kubernetes controller or just a file system runtime that reads this configuration and generates the output for Prometheus to then read and do all the heavy lifting of alerting and ingesting … You can just use your normal workflow, and it just ties into your existing Prometheus instances.
Tom: I’ve noticed that once you start giving people an interface where they can see performance of their SLOs, then within the organization, teams want to be in that interface. They want to see themselves in there.
Using Björn’s rules that he built, we started generating dashboards and uploading them, and then we started emailing a PDF of those dashboards to everyone in the Cloud team every Monday morning. Over time, I’ve found teams submitted PRs to our config repo to add themselves to that dashboard. Now there are like 30-40 SLOs to find, and it’s become a service directory of all the services inside Grafana Cloud and their performance. And whenever this email goes out on Monday, and SLO performance isn’t green, I get an email from the team asking why. And it’s great; I never asked for this, but they feel that responsibility. And culturally, SLOs internally have been a great driver of “Oh, we should all report our SLO performance in the same way. We should all have it visible in the same way.”
I think these cultural changes, the adoption of SLOs and the high-quality alerts that come with it have actually probably been one of the more profound things that have come out of this piece of work. I think something like Pyrra can drive that in other organizations. This is not a question; this is just a statement at this point. It’s just something I’m really excited about.