SLOs, shift left observability, logs: Highlights from Grafana's Big Tent podcast
Since “Grafana’s Big Tent” podcast launched in April, each episode has delivered insights into the people, community, tools, and tech of observability. From the saddest log line in the multiverse to how metrics, logs, and traces are like a mother sauce, our intrepid hosts, Mat Ryer, Matt Toback, and Tom Wilkie, have been joined by their distinguished guests to tackle a wide-range of discussions (not to mention a few conversational detours).
For a special bonus episode to cap off the first season, Mat, Matt, and Tom got together to share their favorite episodes from the past season and plot their plans for the next one.
Listen to their top takeaways from season 1, get schooled on logging with JSON, and hear Matt Toback rant about SLOs with the best stand-in expletives he could muster. Oh, and be sure to send us your suggestions for next season — we’d love to hear them.
Note: This transcript has been edited for length and clarity.
Logs should be human readable — and it’s still fun to say ’logfmt'
Tom Wikie: So, Mat — which was your favorite episode?
Mat Ryer: I really liked the one with Ed Welch when we were talking about logs, because it was packed full of actionable knowledge. I used to think logging in JSON was really clever because it’s structured and you can query it later properly. But Ed made the point that it’s not so human readable and also encourages you to do more complicated things than you should in the logs. I think he’s right about this. In fact, I would go so far as probably not using JSON.
Tom: What would you use instead?
Mat: Probably something like logfmt or log F-M-T, depending on how you want to say it.
Tom: I asked because I wanted to hear how you pronounced “logfmt.”
Mat: I’ll just say fumt. I don’t even look embarrassed when I say it.
SLOs can be nonsense — and it is possible to avoid swearing on a podcast
Tom: For me, it’s the SLOs episode.
Matt Toback: I’ve got a big old bone to pick about this.
If you listen to this episode, there’s so much good stuff. Google wrote the book on SRE and widened the world to actually understand what you need to be tracking and how you need to track it — how software has gotten so much more complex and that it’s not a single up or down. It’s not a power switch.
And then you have contract folks and the business side, which, despite all the progress on this, continues to sign multi-million dollar deals on SLAs. And they’re sitting across the table from each other countering, like someone goes, “99?” and someone else goes, “99.1,” right?
But it’s bull . . . um . . . it’s nonsense. It’s nonsense! Because if anything happens, both people feel like they have a leg to stand on, to argue with each other. Meanwhile, SREs are in the background saying I told you not to do this. This is nonsense.
Tom: I won’t agree to a particularly high SLA in a contract because, to your point, we can’t meet arbitrarily high SLAs. There are things we can do, like move customers to the end of our deployment rollout schedule so the chance of them getting a new bug is lower. So we can do some things to mitigate outages, but when customers ask for 99.9999% uptime, we can’t deliver that in a single region.
It is a useful conversation because once you tease that out as an actual business need for them and they’re willing to pay for it, then we design an architecture that is deployed in multiple regions and it costs twice or three times as much.
So there are some uses for talking about SLAs, but when customers say they’ve got suppliers giving them a better SLA than they give anyone else, I’m going to call BS on that . . . Can I say BS?
Mat: That’s why that exists.
Tom: I’m going to call nonsense on that . . . If you want to see me get riled up, talk about uptime SLAs.
Matt: Tom, what do you think about uptime SLAs?
Tom: How do you measure uptime? We operate a SaaS service and it responds to requests. If you don’t send a request in a given unit of time, was it up or not? Does it even matter? It’s a tree-falls-in-the-woods type of thing.
So we try to offer customers a request-based SLA, agreeing that we will respond to a certain proportion of requests successfully. We feel like we’re doing the customer-friendly thing and giving them something that’s measurable and impactful and meaningful. But sometimes they come back and ask for an uptime SLA. And I’m like, “But it doesn’t mean anything!” But, you can’t tell people what they should care about—you have to meet them where they are.
Iterate continuously —and be kind to your future self
Matt: An episode that I really liked was the one with Nayana Shetty from the Lego Group. I really enjoyed when she said this:
“Shift left and test early. Release as small as possible and continuous iterations and stuff. All of this leads to this point: How do you make your future better? One of the quotes I have often used is being kind to your future self. How can you make your life easy in the future? Think about that today when you’re building whatever you’re building.”
This is such a great quote. It’s something you can repeat and remember when you’re making any decision. Any good software advice is probably good life advice, too.
This episode was unique because Nayana is a practitioner. She’s the one taking all these things we’re talking about and figuring out how to adopt them, reconcile it with all the other stuff that exists, and push everything forward. I loved how she distilled it down to being kind to your future self.
Tom: We talk about it in terms of technical debt as software engineers. Debt is the key word — you’re incurring a cost in the future, which is not being kind to yourself. And internally we have a term called organizational debt when we do something repeatedly the manual, hard way and we probably should have a process for it.
But I would rather have a little technical debt and a little organizational debt rather than over-design and over-engineer everything up front and waste that time and effort. So it’s kind of a balance.
Matt: I like that. It’s kind of like taking out a technical mortgage.
Tom: I think making a conscious choice is the key here. It’s when you accrue technical and organization debt without knowing — that’s when problems arise.
Mat: Big up-front designs where you’re making too many decisions can mean you aren’t giving your future self the options they might need. It feels good because it feels like everything is solved, but you just painted yourself into a corner.
Tom: So I think that brings us to what’s coming next for Big Tent.
Mat: What do we think we can improve on that we didn’t get right?
Matt: I think there weren’t enough tent jokes.
Tom: I pitched you a few tent puns before, but you didn’t go for them.
Matt: Save that for Season 2.
Tom: Oh my.
You need an error budget for your podcast — and your life
Tom: We’ve definitely experimented with different things. One guest, two guests — there was an episode where we had three guests and it worked pretty well but it was difficult to know who should talk when with that many people. I think I’ve learned that one guest, two interviewers works really well.
Matt: I did an internal podcast episode with the different leads of the community calls — Tempo, Loki, Grafana, the whole thing. And it was too many. There was no banter and people were a bit more nervous about what they were saying. It just didn’t work quite as well.
Tom: We also tried to introduce a “what’s your favorite dashboard” segment where one has to describe their favorite dashboard. But that only lasted one episode I think.
Mat: That’s a shame because I loved that idea, but you’ve got to be able to fail. And make people comfortable enough so you can try things and take risks. You need an error budget. You need an error budget for podcast ideas, too.
Matt: An error budget for life!
Mat: For life! Yes, you do. Genuinely. If you’re too cautious, you’ll tie yourself in knots. Allowing things to fail, taking it on the chin, and owning it – that’s very healthy.