Observability strategies that work — and some that don't
Creating an observability strategy is a lot like playing with Legos: It takes small building blocks to create a bigger picture, but the slightest mistake could throw off an entire build — and often you realize it very late in the process and have to rip and repair the
Hogwarts castle infrastructure you spent many days creating.
In the latest episode of “Grafana’s Big Tent” — our new podcast about people, community, tech, and tools around observability — hosts Mat Ryer and Matt Toback talk to an expert who knows a lot about both observability and Legos: Nayana Shetty, principal engineer at the Lego Group.
Listen to our episode “Observability in the wild” to learn more about how organizations can no longer afford to introduce observability as an afterthought and some effective (and not-so-effective) strategies that will help set you up for observability success.
Note: This transcript has been edited for length and clarity.
Observability strategy #1: The carrot-and-stick approach
Nayana Shetty: Over the years I have been on teams where they build microservices. And when you scale up and have hundreds of microservices, how do you then make them reliable and keep them reliable? That’s what I’m interested in. I was working at the Financial Times where we had hundreds of microservices, and now I’ve moved to the Lego Group where we’re going through a massive digital transformation. We want to build hundreds of microservices so should we care about reliability now? Or can we think about it in like 10 years time when we have these microservices?
Mat Ryer: It used to be a kind of afterthought really, didn’t it? Which is why SRE, I think, is short for sorry, right?
Nayana: [laughs] That’s one way of looking at it! Sorry, I don’t understand why people don’t think about site reliability in the first place. Or sorry, I don’t understand why people would build this in such a way that it is half-broken or they didn’t think about the future of this product. And you are very close to reinventing the wheel every few months, if you went in that direction. So yeah, sorry, is probably one of the ways of looking at it.
Matt Toback: On a personal note, I’m excited that Nayana is here and joining us because we met in 2018 for the first time in an attic in Amsterdam, which when said that way, doesn’t feel weird at all right?
Nayana: We were talking loads about monitoring: Grafana, Graphite, all of those things.
Matt: I do remember! You stood out to me and being up in that breakout room talking about what you’re trying to do with the Financial Times. And it does feel like you continued in this natural progression in your journey. When you think back to then how did you see the observability world?
“Be kind to your future self. How can you make your life easy in the future? Think about that today when you’re building whatever you’re building.”
— Nayana Shetty
Nayana: At that point, we were investigating. We had quite a lot of monitoring tools at the Financial Times. And I was working in the team that provided monitoring as a service to other teams. And my head was going mad thinking, how will a team of like four or five engineers be able to support these 20-30 odd engineering teams who all want monitoring. And they’re using tools from Nagios to Zabbix, some Graphite. I think there were very few installations of Prometheus, at that point. How do we get all of these different use cases together, and how do we get them on a platform which could work together? I was worried at that point and three or four years later, looking back…
Matt: You’re still worried now? [laughs]
Nayana: I’ve moved on so I’m less worried about the Financial Times’ monitoring systems! But I still worry about the same [issues] in the Lego Group where there’s different monitoring tools that we’ve got across the organization. How do we get them all together? How do we tell a single story that everyone could understand rather than every single team trying to solve the same problem? So it’s still very similar.
Mat: Something you said earlier stood out: This idea of why did you build it like this? If only you’d built it differently? We’d be in a much better position now. So it’s kind of like…
Matt: If you only did it right. Is that what you’re saying?
Mat: Yeah, but that’s the question: When should we start caring about observability? When should we start worrying about how we are going to operate this?
Nayana: I think this relates to the journey in my career. So I started off as a test engineer, just doing some manual testing, then moved on to do more QA. And over the years, I’ve seen the transition in a lot of organizations where they’ve moved to shift left [observability], test early, release as small as possible, and continuous iterations.
One of the quotes I have often used is be kind to your future self. Like, how can you make your life easy in the future? So think about that today when you’re building whatever you’re building. If you’re building a new product: Think about do you even have to build it? Or can you just look at what’s in the market and reuse it? If it’s a non-differentiating thing, then why build it? If it is a differentiating thing, yes, put your heart and soul into it. But then when you’re doing it, make sure you think about the sustainability aspects of your product. And not just today, what the customer would get.
I’ve often used this carrot-and-stick approach in teams to say, show the benefits of what you could get out of thinking about monitoring and observability from the front. Usually the carrots like, if you build it in the right way, then you can actually forget about your systems because they will take care of themselves. And the stick approach is often, if you didn’t do it, then you have to go into [everything] that comes with making your systems more observable and keeping it sustained once it’s up and running.
Mat: Yeah, you know, I would be kind to my future self, but I’m too busy dealing with all the stuff that my past self has left me to do … That’s the thing, if you think about how it’s going to be, where it’s going to be running, the earlier the better, isn’t it?
Nayana: Yes, you are fixing things from yesterday. And if you don’t fix it, and leave some goodies along with it, then tomorrow, you’re fixing today’s problems. So you’re in that vicious cycle. To get away from that same vicious cycle, you need to actually step back sometimes and put in that extra effort.
Observability strategy #2: Design for failure
Mat: Thinking about [your observability strategy] upfront is a bit like how you design for failure as well. You know, in a perfect world, all the messages flow perfectly in your system, and there are no problems. But in reality, it’s way more messy. Things fail and things come into play where you may design expecting this is going to fail. I write Go code, and Go has error handling as an explicit feature. There are values that are just returned as the second argument to functions, and that frustrates a lot of people because they’re used to exceptions or something that’s just automatic. But it forces you to think about what’s going to happen if this thing fails. That’s a great discipline to get into.
Nayana: I think it’s a myth to think that your system won’t fail. Always build your system in such a way that it will fail. If it doesn’t, then you have a problem. So make sure you add those checks in place. So when it fails, it can smoothly recover.
Mat: I know some companies that have that as part of the proper testing approach. They’ll literally break things on purpose. And, you know, it’s a first-class concern that they have. And I don’t know, is it just ego that people think,“I’m so good? I’ll write this, it’s going to be great.”
Matt: Honestly, it can’t be, right? Like we’ve all known and experienced [failure] enough.
Mat: I don’t know. When I’m writing code, and it doesn’t work, it’s shocking how quickly I’m like, “The processor is not working, or physics has changed.” I’ll go to physics has changed before it’s my fault. But turns out, I just did a capital letter where I shouldn’t have.
Nayana: I’ve been in teams where they do pairings and mobbing sessions and stuff. They have helped in checking people’s egos to be like, “I’m not the best.” And when two people talk about it, I think it does help them think, okay, that is a reality that we live in, and [failure] is what you need to consider.
“I think it’s a myth to think that your system won’t fail. Always build your system in such a way that it will fail. If it doesn’t, then you have a problem.”
— Nayana Shetty
Observability strategy #3: Run drills (but not at 3 a.m.)
Mat: We’re trying to do this with the best intentions in the world, but are there any things you see that people misunderstand? Or common mistakes, common “gotchas” that you’ve seen?
Nayana: The thing I’ve seen and I’ve struggled a lot with is … it’s very hard to get your network-related monitoring right. I’ve seen myself have the wrong set of dashboards and alerting and I wonder why this is going off every time something happens when it shouldn’t have. So I think being okay to experiment and continuously tinker with your monitoring and alerting as you go along is probably something that teams should be conscious of. It’s not that you build it once, and then it’s there forever. But there is a continuous evolution that should happen with your monitoring. Just like how your feature sets go through a cycle, you have to do the same with the observability side of things as well.
Matt: Mat, can I answer too?
Mat: Um, lemme just check … No.
Matt: Oh, come on!
Mat: Please, I’d love to hear what you think!
Matt: I think the common gotcha is forgetting that you need to deliver something that someone could adopt easily. I was thinking like car parts — or Legos — but it’s like dropping off a collection of car parts and being like, “There you go!” And you’re like, “All I want to do is drive. You haven’t helped me really at all.” And, you know, you can call a Lyft and that’s where the metaphor breaks… But there is just dropping a collection of pieces that can work and expecting the user to do the last mile.
Mat: What helps that is going to be this idea: You build it, you run it. We’re not throwing this thing over the wall for someone else to operate, which I know lots of people do stil. There’s a disconnect. When you yourselves are running it, you are the customer of that data. So it’s a bit like when you dogfood software if you’re building dev tools. We do that at Grafana, we dogfood a lot. We’ll use our tools a lot internally; that’s why they’re so good, frankly, because it’s not like we’re imagining the user of this. We are the user of it, and I think that makes a big difference.
“There is a continuous evolution that should happen with your monitoring. Just like how your feature sets go through a cycle, you have to do the same with the observability side of things as well.”
— Nayana Shetty
Nayana: One of the comments I’ve heard a few people say is build your code in such a way that you can debug it at 3 in the morning. It doesn’t mean that you have to do it every day, but if it breaks at a time that you’re not fully in focus, you still can get to it easily.
Mat: That’s such a great point. And that leads me to my next question, which is around drills. Should we be doing drills at 3 a.m. and living that experience to see what it’s like?!
Nayana: Three o’clock is probably taking the mickey out of people….
Matt: Do people do drills?
Nayana: I have seen it done. I think it’s a very artificial environment where the drills happen. So one of the things that we did when I was at FT was we had these incident drills. So basically you emulate an incident and then with the team, how do you go about actually figuring out where the problem is? So you start with, which alert it was, and then look at the traces and then look at what the logs were. And like, you go through the whole cycle of it. There were a lot of people who were not very keen on this because it’s an artificial environment. People felt like that is not reality. So why do it?
Mat: It’s because you didn’t do it at 3 a.m.