Sometimes the simplest questions prompt the most spirited discussion. Questions like: What is the airspeed velocity of an unladen swallow? What should we have for dinner tonight? Or, as we find out in this episode of “Grafana’s Big Tent" what even is observability?
In this episode of “Grafana’s Big Tent” dedicated to discussing — and debating — the idea of observability, hosts Mat Ryer, Matt Toback, and Tom Wilkie talk about traditional and non-traditional definitions of observability, evaluate observability methodologies, and share best practices for monitoring and alerting on SLOs.
Oh, and they threw in a quick cooking lesson on the mother sauces for good measure.
Note: This transcript has been edited for length and clarity.
What is observability?
Tom Wilkie: So what can we say about observability that hasn’t already been said? I think a lot of people in the modern observability market will talk about metrics, logs, and traces, but it is definitely not metrics, logs and traces. I don’t think observability is any one technology or tool.
Matt Toback: Is it different than what came before it? Is observability as a new term different than monitoring?
Tom: I think so, because monitoring was almost always about time series, about metrics, about numbers. And observability is not necessarily about that. I think monitoring is part of it.
Mat Ryer: But you still want to find out what’s going on in your system, right? When you have a simple program, it’s easy, because it’s not doing that much, and it is quite obvious to see what’s going on. But as systems get bigger, and they’re used by more people and they’re more complicated, it creates these new problems where we’re suddenly hidden a little bit sometimes from what’s really happening. So observability has to be something around shedding some light on what’s going on inside as well.
Tom: Well, there’s the systems definition of what observability is, which I don’t particularly find helpful. It’s something along the lines of like the ability to infer the internal state of a system from its external outputs. I think that maybe is a useful definition of what makes a system observable, but doesn’t necessarily describe what observability itself is.
Matt: I think I really struggle with the idea, too, of when do you know that it’s happened? When has observability occurred?
Tom: The phrase I have been using now for 3-4 years to describe what I do — and I think some of what I do is observability — but it’s helping people understand the behavior of their applications and infrastructure. So my application does things, often things I didn’t expect it to do, and I would like to understand what those things were and why they happened. And I’m going to need tools, technologies, practices, teams, people to help me do that. For me, that’s what observability is.
Matt: To me, it’s the ease with which you can bounce between these different things. You say metrics, logs and traces; I guess I just look at it as different granularity of data, or different data presented in different ways, that’s all kind of alongside each other.
Tom: I think there’s a cooking analogy here. Maybe metrics, logs, and traces are the mother sauces.
Mat: What’s a mother sauce?
Matt: There’s four mother sauces, is that right?
Tom: I thought there were five…
Matt: Is there five?
Tom: Quick, Google “mother sauces”!
[Editor’s note: There are, in fact, five mother sauces — béchamel, velouté, espagnole, hollandaise, and tomato.)
Four golden signals of observability and USE vs. RED
Tom: At Google, they had this thing called the Four Golden Signals, which is very much like for every microservice you should monitor four things. You should monitor the request rate, the error rate, the latency (the distributions of times it takes to process requests), and the saturation of that service.
I spent a couple of years at Google, and this was part of the stuff they teach you. Then when I left, I just forgot about saturation. I just forgot. So we coined this phrase as a kind of play on words against the USE method (utilization, saturation, errors) to be the RED method: rates, errors and duration. This is for every microservice, make sure you export these three metrics.
There’s a particular style of dashboard where you plot request rate and error rate on one graph, and latency on another graph, and then you do a breadth-first traversal of the microservice architecture to lay out each row. One of the things I love is all of the services within the company, all of the dashboards look like that. So I can dive into any service and start to get an idea for both its architecture and where the errors are being introduced, where the latency is being introduced. And then, of course, someone reminded me about saturation and I’m like, “It doesn’t fit into this model.”
“We’re still learning what the best methodologies are and how best to monitor things. Of course, as [these systems] gets more mature, this also gets more automated.”
— Tom Wilkie
Mat: Well, that’s an interesting point then. So if you have a common method that you use all the time, you can then present that data in a common way, and there’s obviously advantages to that. But the systems themselves can be so different; and as you said, it doesn’t always fit. It depends.
Tom: I mean, a great example — the RED method would be absolutely useless with something that’s like an enterprise message bus style architecture. Because what is the thing you’re measuring the request and error? And what does it mean to have a duration of a message? You need different philosophies for different systems, I feel.
Mat: Does this make it bespoke for every project then, essentially?
Tom: I hope not! Otherwise, this gets really hard. I hope there are some common methodologies that can be applied to multiple systems, that maybe share a common architecture. But yeah, I think we’re still learning what the best methodologies are and how best to monitor things. Of course, as this gets more mature, this also gets more automated.
Monitoring on SLOs and the problem with traces
Tom: There is a school of thought that says the only thing you should alert on are your SLAs or SLOs. Because an SLA is actually an agreement to hit an SLO, with penalties if you don’t. So really, the thing we care about as engineers is what’s the SLO; what is our objective, what are we gonna measure, how is that proxy to the user’s experience, and how do we enforce that? There is a big school of thought that says that should be the only thing you monitor on. And my example being practical here would always be – that and disk space. You should always monitor disk space, because it’s really easy to monitor, and the consequences of filling your disks up is really bad, so why would you not?
Matt: I want to talk about traces though, because I feel like two years ago you sat in Stockholm and you were like, “Traces are this, traces are that, traces are amazing. Nobody uses them. And if you don’t do it all, it’s kind of useless.” So plus two years, are we anywhere better, or do we just kind of nudge each other and be like, “Yeah, yeah, traces…”
Tom: This is a really interesting point, Matt, because as I said earlier, the great thing about observability is the switching cost is super low, and that means you can innovate loads. And I think that argument falls down when you start to talk about tracing, because the cost of adopting tracing is still too high, in my opinion.
Matt: Does Copilot do tracing integrations?
Mat: I don’t think so, but you know, I bet if you start writing something – I bet if I started instrumenting my code, Copilot would help me do that.
Tom: But on the point of distributed tracing, I feel like we’re not quite living up to the promise of observability there, at the moment. That being said, the value of tracing is still very, very high. We’ve been on a journey with the systems that we run to get them to be very performant, and people have very high expectations of how quickly their dashboards load, and how quickly their queries succeed. And we would not have been able to achieve the latencies that we’ve achieved without distributed tracing. Because it’s all in the long tail. It is still unfortunately in my opinion too much effort to get these high-quality traces. But when you do, that’s how you control your long tail.
And then once you’ve got high-quality tracing, so many things become possible. You can start to do things like check that your SLOs nest nicely. If you have interdependent systems with different SLOs, you know the dependencies between them, and then you can check that you don’t have a tighter SLO than one of the systems you’re depending on. You can do all of these great things, so it is super valuable. But yeah, it’s not as easy to adopt as I’d like it to be
Alerting pitfalls and best practices
Mat: So when you talk about all this data, Grafana Loki making it really affordable to keep all this data around, then you’ve got loads of data. This is another area where AI could definitely start helping us as well – looking at that data for us and trying to give us insights. What other things can be done around that?
“You should always alert on symptoms, not on causes. Except for disks filling up. That’s the exception that proves the rule. You should always alert on symptoms — and get those symptoms as close to the user as possible.”
Tom: Contentious opinion: you should not be staring at dashboards in Grafana. You know, you go into offices and they’ve got big screens on the wall, and more often than not, it’s a Grafana instance on there. I mean, it’s very pretty, it looks good, but it’s a distraction. I don’t want to pay engineers and I don’t want to spend my time staring at a dashboard, trying to figure out if something’s broken, when I could have written a piece of code that will do that for me. And that piece of code is called an alert.
I think you should use alerts. And I think that’s a big, common mistake, especially with things like Grafana dashboards, which are so pretty and so easy to use. It’s so easy to build these great things, and you want to kind of share what you’ve achieved, that you can sometimes over index on that.
Mat: Can you also over index on alerts? Can you end up just with too many alerts?
Tom: All the time. A number of my colleagues and friends have pager overload. And they’ve done the right thing, but then they build an alert for absolutely everything.
You should always alert on symptoms, not on causes. Except for disks filling up. That’s the exception that proves the rule. You should always alert on symptoms. Get those symptoms as close to the user as possible. Use SLOs. And then you get to the really interesting space. Use error budgets. So allow a system to fail a certain amount.
Here’s an example I always give here that kind of highlights it. We agreed on an SLO with the customer very early on: 99.9% of writes should succeed within 100 milliseconds. And the system hit that all the time, except for when it didn’t. So we built an alert that said “If 99.9% of writes don’t succeed in less than 100 milliseconds, page me.” We made the window five minutes and we got paged loads. Loads, like probably five or six times a month. At the end of the month, I go and run a month-long query to say how many requests succeeded in less than 100 milliseconds, and the answer was always like eight nines of requests. All the requests, basically, succeeded. So effectively, I was getting paged when I was within my SLO.
Bjorn [Rabenstein] from Soundcloud, Prometheus, and Google fame explained to me that we needed an error budget. Instead of effectively alerting on breaching your SLO within a small window, you want to alert on breaching your SLO in increasingly larger windows using some kind of multiple. What you’re actually alerting on is the rate at which you’re using your error budget. And I was mind-blown when he explained this. We implemented this in our services, and our pager load and the pager fatigue that went with that just disappeared.