'Grafana’s Big Tent' podcast: Season 2 is here!

• 2024-08-23 • 17 min

Your favorite podcast on all things observability is officially back and better than ever.

Today, we’re excited to kick off season 2 of “Grafana’s Big Tent,” our award-winning podcast about the people, community, technology, and tools shaping observability.

We launched “Grafana’s Big Tent” in 2022, with the goal of engaging in fun, thoughtful, and open conversations with members of the observability community. The name of the podcast originates from our “big tent” philosophy — the belief that you should be able to access your data, no matter where it lives, and choose the observability tools and strategies that best suit your needs.

“At Grafana Labs, it’s way more than just about the technology we build,” said Grafana Labs CTO Tom Wilkie in the introductory episode. “Grafana is one of the few pieces of software in the world that touches hundreds of different projects. And we want to use this platform we’ve got as a way of shining a spotlight on some of those projects.”

We loved recording that first season — and, more importantly, a lot of you loved tuning in. In 2022, “Grafana’s Big Tent” exceeded 10,000 downloads across more than 100 countries. It was also recognized as the Best DevOps Related Podcast (audio only) in the DevOps Dozen 2022 awards.

Now, we’re picking up where we left off, rolling out a second season of episodes that spotlight some of the incredible people and innovative projects driving the observability space today.

In the first episode, “Cache Rules Everything Around Me,” Grafana Labs Engineering Director Mat Ryer returns as a host. He’s joined by Ed Welch, Principal Engineer at Grafana Labs; Danny Kopping, Staff Software Engineer at Coder and former Grafanista; and Alan “dormando” Kasindorf, maintainer of Memcached, the open source high-performance, distributed memory object caching system.

The group gets candid about caching, exploring everything from CPU- to application-level caching and sharing strategies to supercharge the performance of high-traffic, distributed systems.

Note: The following are highlights from episode 1, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.

Mat Ryer: Hello and welcome to Grafana’s Big Tent, the podcast all about the people, community, tools, and tech around observability. I’m Mat Ryer, and today we’re digging into caching.

I’m sure we’ve all at least encountered some form of cache in our computering lives so far, but what’s really going on there? We’ll dig a bit deeper and look at some technical considerations around that, and maybe even explore some common gotchas. And we’ll look at this in the context of a real case study from within the Loki project here at Grafana Labs.

Welcome, Danny, and welcome back, Ed. We’re also joined by Alan Kasindorf. Alan, you’re Dormando online, and you’re the full-time maintainer of Memcached. Welcome to the show.

How do you define caching, exactly?

Mat: So, as the famous joke goes, the three hardest things in tech are naming things, invalidating the cache, and invalidating the cache. Let’s start by actually looking at what caching is. Ed, if you had to define caching, how would you do it?

Ed Welch: So, when I think about how I use cache — which is how I’m going to define it — it’s because I want to save myself from doing some work twice. So, I do something that takes some amount of time, it’s computationally expensive, it’s IO expensive, or any of those things, and I don’t want to do it again. Caching is sort of a faster path to getting the same result. I did not look up any actual definitions beforehand so I’m curious to see if others have some variations.

Mat: Yeah, I mean, that’s certainly how I’ve used caches before. Danny, what do you think?

Danny Kopping: Roughly along those same lines. I think where you see cache in other places is sort of preemptive caching. So, instead of saving the results of a previous computation, you can preempt what you’ll need in the future. And this is done a lot in CPUs and in web applications. In the past, I worked on an e-commerce application, and when we would boot up our memcached instance, we would pre-warm the cache with all of the data about the store. So that would be done ahead of time, rather than save what you’ve done in the past… which is kind of another awesome feature of cache. It doesn’t really care about where the data came from. It just helps you speed things up and lower latency.

Mat: So caches are quicker to read from and write to compared to other things?

Danny: Yeah, typically caching is done in memory, and that has a very, very low latency, compared to, for example, looking it up in the database, or retrieving it from disk. Not all caching is done in main memory, though, as we’ll chat about later, but typically it’s done that way so you have the fastest access possible to that information.

Mat: So, these kinds of decisions about where we should cache things — should they be data-driven? How do we know what would be a good thing to cache? Surely just cache everything, and then it’d be really quick, won’t it?

Alan Kasindorf: So, the definition of not doing something twice is pretty good for cache. It’s probably the best way to really think about it if people are just starting to think about it. But the other main thing with caching is just getting data closer to you — usually physically. So, for CPU caches, L1 is really in there next to your execution units, which are actually doing the math. And then you have L2, L3, and then main memory, and it’s a matter of distance. The closer something is, you start to refer to it as being cached.

Peeling back the “layers” of caching

Mat: Could you explain a bit about these different layers of cache, because you hear this a lot — L1 cache, L2 cache, etc. What do these levels mean, and what’s different about them?

Alan: So, when we talk about caching, it’s mostly application-level caching. So very high-level, L7 in the parlance, the applications talking to each other over the network. But when you talk about caching in the general sense, related to computers, the first thing people usually come up with is the CPU cache, which is holding data in code a lot closer to where the execution units are, so the CPU doesn’t have to wait as long. So you have these CPUs that are running at four gigahertz, five gigahertz… they’re doing things like 5 billion times per second, and having to wait for data completely kills them.

Accessing something from a memory, or a disk, has your CPU just sitting and doing nothing for a very long period of time. So if you really want to get the usage out of your computer, you need to have very good usage of caches at various layers. And so you kind of pull that up to a higher level, where maybe you have a whole server sitting there in the cloud somewhere, and it’s waiting on a database response, and you can’t do anything else while you’re waiting for that database. You might have other processes on the box doing something, but in general, you’re paying for this thing to sit there and you want it to actually be active and doing things, instead of just waiting for data.

Mat: And what does a typical caching API look like? In application-level caching, I’ve used things like just simple,get, and put. And sometimes you can set an expiry. Is there a common pattern? Are there different flavors of this?

Ed: I know, for example, in our application-level caching, we tend to use pretty simple interfaces. Really, one question would be whether there’s some mutability in your cache or not, and whether or not it’s just a get input. I think we actually might do both with Loki. We might mutate, and if I’m wrong about it, Alan is going to tell me whether memcached supports muting of keys within cache or not… or if we just append multiple entries and it sort of appears like we’re mutating it based on how we access it. But for the most part, our APIs are pretty much CRUD. Although I suppose we don’t really delete cache stuff, we just rely on expiration for that. So we use Least Recently Used type access to expire things when they’re out of space in our cache.

Alan: The common APIs are really just gets and sets. The joke, especially at larger companies, is “What’s so hard about cache? It’s just gets and sets.” It’s really gets, sets, deletes, and sometimes you want to update the TTL, (the time to live)… there’s lots of little things that come up, but really, it’s just gets and sets. And then you have more advanced stuff that maybe we can get into later, but in general, the application cache doesn’t usually operate the same way as a CPU cache, which is usually readthrough and writethrough.

Danny: If folks want to hear more about CPU caches, there’s an amazing talk by a chap called Scott Meyers. He does a talk called “CPU caches, and why you care.” Working in very high-level languages, I’ve never really considered how things work at a lower level like this, but when you start working on very performance-sensitive code, writing software in a way that is mechanically sympathetic to how the CPUs work can really improve the performance of your code in quite surprising ways.

High performance at scale and caching best practices

Mat: Yeah, again, this comes back to that point of when does this become a problem for you? Because for a lot of people, depending on what you’re building, if there’s any kind of bottlenecks, you’re really going to feel them at scale.

Danny: Yeah, absolutely.

Ed: I’ve been thinking about Alan’s extension of my original definition, and it will tie into a big part of what we want to talk about: some caching changes that we made. For a system like Loki, a distributed system, we’re largely just trying to get stuff within the same rack, or set of racks within an availability zone, or two, or three. And there’s sort of two ways that we use cache.

And maybe one of the things we can talk about is when you shouldn’t cache, or you know that you need to cache. It is going to be a bit subjective based on what you’re doing. But like a lot of things, these problems will present themselves; I wouldn’t necessarily suggest you prematurely cache your results for whatever your app does. But one of the things we find is that the access patterns for a time series database that is backed, or is backing a dashboarding tool, means you get a lot of repetitive access to the same data. The dashboard will be refreshed, and you will be essentially drawing the same data repeatedly. Most of that data hasn’t changed, only the recent since the last refresh interval. So that was a use case where caching results becomes incredibly valuable for fast reload times. We don’t want to have to fetch the same data again and again.

But we also cache the data that we fetch from object storage. Loki is a database built on object storage, like S3, GCS, and Azure Blob. These object stores are very fast; I’m sure that they have a tremendous amount of caching inside of them based on the behaviors that we see. But it’s still the case that when we’re using stuff multiple times, it makes sense for us to cache it closer to where we’re accessing it. So we just cache the data chunks themselves.

Alan: Keeping it simple is easier said, I think. Usually, the common refrain with DBAs is, “Oh, somebody added a little cache when they really just needed to add an index.” So, usually, you probably want to cache database-level stuff after you’ve kind of run out of other options. You’ve done all the good things: you actually understand where your database is, your database itself has more than like 10 megs of cache configured, the machine is sized properly… You really want to try to not do that, because it is adding complexity. And you only really want to use cache if it factors into your cost metrics. You want time to be better, you want it to cost less overall.

Mat: That’s really sound advice. I’ve definitely in the past reached too soon for the cache. And actually, as soon as there’s a few of these interacting systems, the number of possible combinations goes through the roof, and it can be really quite difficult to debug caching issues. Any web developer knows that the JavaScript in the page is cached, and sometimes refreshing that doesn’t actually get the latest version of the code. Things like that still happen, which is kind of outrageous.

Danny: And you opened up with one of the most pernicious problems, which is cache invalidation. Like, how do you actually know when the data that’s in your cache is actually the right data? And if you reach for the cache too soon, and you don’t realize that you can start running into this class of problem, you can start developing a whole new problem for your application. So I guess the first law of caching would be “don’t reach for it too soon”, and then the second one would be “know your workload.” Understand what’s going in there, and when it can go stale, how to invalidate it.

Mat: When you read data from the database, you might then change it or do something to it before putting it in the cache. Sometimes, you will just take the same data and have a copy of it. But depending on how that’s going to be displayed, it’s very common to actually prepare the view almost, in advance, so that it’s very easy and very quick to read straight from the cache. Are there good practices around this? Does this just depend on each case?

Alan: I think start simple. So you start with “I’m going to cache some data from my database,” or a picture from somewhere else. And then the next time you look at your profiling data, or you look at, say, your web server load, and you say, “Well, we’re spending 80% of our CPU applying templates to HTML, and then serving that.” That’s how it used to be. We go back and forth between single-page services, and backend rendering, but sometimes that backend rendering is a huge part of your cost and your time. And so you say, “Well, actually, if I am not re-rendering these templates all the time, I can cut my web server load in half, and save half my money, and serve these pages faster.” But again, that’s just more complicated.

Now you have to worry about, “Am I accidentally caching something from the wrong language that’s been templated?” or I have this combinatorial explosion of different ways you have to cache it, so you’re kind of like “Well, maybe I’ve got this sub-piece of it that takes a while to pull in from three different sources”, and you put it all together, and then you cache that. And you kind of just have to look at what your highest pain point is, and only really focus on that. If you think, “I’m going to go holistically from the bottom up and look at every single access we do,” you’re gonna get lost and just give yourself lots of places where you might end up caching the wrong data, or serving the wrong data to your users, or something they didn’t ask for anyway.

Danny: Yeah, the way that I’ve used cache in the past is almost like a materialized view. So you’ve got a very complex database structure — and, for example, to take the e-commerce store again – you want to find out how many orders a particular user has placed. That becomes a bit complicated, because you have the users table, and the orders table, and some orders failed, and they didn’t go through, or there was another problem. So to actually count the number of orders that were successfully placed, you have to run a select statement that selects from multiple different tables. And so, in the SQL space, you could have something like a materialized view, which is a query that gets run on a particular cadence, and denormalizes that data, so that instead of storing it in this perfect form, where everything is related to each other neatly and there are these very atomic fields that just represent one thing, now you’re combining them into multiple things to represent some more abstract idea, or a more combined idea. And so with cache, you would do that, too. You would change the shape of the data, so it wouldn’t be exactly what is in your database. You would do a computation on that data and then store the result of it in cache. So it wouldn’t be exactly what is represented in your database.

File size considerations — and recent Memcached updates

Ed: I have a question, Alan, on file size, because that’s kind of a relative. I was saying how Loki has small files, and you’re saying Loki has large files. From the perspective of something like object storage, we generally target making files that are a couple of megabytes in size, which are kind of smaller for large-scale data storage. But for cache, even sort of historically, we were very familiar with memcached giving us out-of-memory errors, because we were largely allocating all the slabs in the 512 KB and up. Is there an ideal size, and/or is there a limit to the size object you would want to try to store this way?

Alan: So with Extstore, there’s some technical trade-offs. The maximum size of the item has to be at least half the size or less of your write buffer or something, just based on how it handles the data. As far as large objects go, it doesn’t really matter otherwise. Years ago, if you were storing a lot of very large items, you did tend to get out-of-memory errors, if you were writing very aggressively. That algorithm in recent versions of Memcached has actually been improved.

So Memcached used to get very inefficient when you stored large data. It was kind of designed for much smaller data, in the realm of a couple dozen kilobytes, or down to a kilobyte or less. It was very good at very small data. But once you started putting in half a megabyte or larger, you’d start losing like 10%, 20%, 30% of your memory to this overhead based on its memory management. And a long time ago now, I edited that, so when you store a large item, it actually splits that up internally into smaller items.

And there were some edge cases in this algorithm that were fixed in the last two years, I think. So if you try that now, it’ll probably work a little bit better. Extstore should be more or less fine at handling these larger items. I wouldn’t put like 20-, 30-, 40-megabyte items in here, just because it’s used to doing stuff on the smaller size.

Ed: I should say that, because Memcached has been so reliable for us, we didn’t upgrade it for a while, and that class of errors represented a small percentage of our write traffic, and then we just kind of tolerated it. And then we have since updated and those have all gone away. So it all works very well now – thank you very much.

Alan: Yeah, the not upgrading thing is pretty common. It’s pretty hard to convince people that, despite the fact that I usually do monthly releases of the software, and it’s usually a surprise to people to find out that it’s still in development.

Ed: Yeah… we were super excited to see those changes. And like I said, that class of errors has gone away, and now we’re finding that Extstore works really, really well for what we’re doing here. There’s a part of it I didn’t quite talk about earlier, which is that one of our goals here wasn’t to take all of the traffic away from object storage. Loki is a highly parallelizable or parallel machine… and what we were looking to do was take a fraction or some portion of that from object storage. The amount of total data will always be more than we’ll probably want to put on disk, but we can take a controllable amount now by changing the size and the number of nodes we run to be able to say, “20%, 25% of our data will come from local disk and the rest from object store — 50%” and that gave us a really nice knob to turn there. So I’m very excited. If anybody has not seen or checked out Memcached in the last few years, definitely do it.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com.