What’s the saddest log line in the multiverse? A log line without context.
That’s according to Grafana Labs software engineer and Grafana Loki tech lead Ed Welch, who joined Grafana Labs VP of Culture Matt Toback and Engineering Director Mat Ryer for the latest episode of “Grafana’s Big Tent," our new podcast about people, community, tech, and tools around observability.
Listen to the latest “All things logs” episode and learn how to approach your logging strategy with the endgame in mind, why structured logging will improve your quality of life, and hear the logfmt chants that you didn’t know existed in the Go community.
Also for more on logging and Grafana Loki, join us for our free webinar “Scaling and securing your logs with Grafana Loki” on May 19.
Note: This transcript has been edited for length and clarity.
An introduction to log strategy and log levels
Mat Ryer: How early in the process should we start to think about the logging strategy?
Ed Welch: Logging is probably one of the things that we’re most familiar with and also something that we really don’t think a lot about at all. It is an after-the-fact thing, where you go back and realize you have way too many logs or you can’t understand them. So thinking about it up front and trying to build up some ideas for how to approach them is definitely going to help everybody.
Whenever you log something, you really should consider – What is this log line going to be used for? Who’s going to use it? What level should it be logged at? This is a hard one. But it really helps to think about what the life cycle of that log line should be. How can it be useful in the future? And if it’s not, do you really need it?
Mat: What’s a log level?
Ed: Good question. Every programming language has a different definition of levels. The range runs from something like trace, debug, warn, error, and info. Most loggers let you choose a severity to log at. If you selected info, only warn and error logs would be displayed. If you choose trace, everything would be displayed.
Mat: Is everyone using the same levels?
Ed: No. Nobody uses the same levels and nobody can agree on how to use those levels.
Mat: I tend to use two levels. Each log line gets a level, so you’re tagging a log line to say either this is an error, that’s very useful, or you have another level that’s just for information. Warnings are in there, too. It’s difficult to separate out those different levels, and I try to go for as few as possible. I usually get away with two, which is either an error or just info. And then sometimes you want extra bits for debug and when things are really going wrong.
“Whenever you log something, you really should consider – What is this log line going to be used for? Who’s going to use it? What level should it be logged at?”
– Ed Welch
Logging best practices
Mat: So how do you make a log line useful to somebody? Can we just put a runbook link in for certain errors?
Ed: That was a pattern that I was very, very familiar with. And it’s something that I think log solutions should probably try to adopt and give that familiarity with a command line to be able to say, I’m looking at garbage. I want to start picking off things that don’t matter and reducing the result to something that’s hopefully going to help me understand what I’m looking for.
Mat: Knowing that upfront can help. So now we’re thinking when we write out logs, we are going to deliberately think about the kinds of things we want to do later. And maybe you could give us a little intro into what structured logging is. It’s quite new. What is structured logging and do you recommend it?
Ed: It’s actually changed a bit over time, but structured logging is basically when you want to log something in a format that’s machine readable and in a relatively easy way. I want to make it easy to either match a specific pattern or log in something that’s in a serialized format like JSON, because logs also need to be human-friendly. We have to go through from time to time and actually look through our logs. And JSON is not human readable. It’s not. It’s fine to log JSON, it’s ubiquitous, and all of the logging tools work with it well. But keep the objects really simple. If you need to log complex data, use a database or store it someplace else. There’s a hybrid that’s common in the Go world: logfmt. It’s a key-value, but it’s also a structured pair.
Mat: I like the fact that you used “fmt.” Anytime someone says “fmt” outside the Go community, it solicits an interesting reaction. So I’m really happy you did that before I did. So what is logfmt then?
Matt: Is this real? Do people say this out loud? When a bunch of Go people get together, do they just go “fmt”?
Ed: I’ve only ever heard two people say it out loud — and Mat is one of them.
Mat: We don’t chant it, though. It’s not like FMT! FMT! FMT! It’s not like the order of Go. If anything, it’s science-based. But we do say “fmt,” though. Well, I do. It’s weird. Ed said it first.
Ed: So log format, logfmt, is a key-value pair that’s space delimited. The values are wrapped in quotes if they have spaces in the, and then the quotes are escaped inside of that . . . It’s structured, but what I like about it is that it removes the curly braces and a lot of the commas and the quotes so it’s more condensed and easier to scan as a human trying to read log lines.
The one thing every log line needs to have
Mat: Are there any common things that you think we should just always put in a log line? Is there anything that’s just sort of low hanging fruit that we should all be doing? Tell us them, please. Now.
Ed: The one thing that every log line needs to have is some context. Good examples of context are something like a traceId or an orderId. Something that’s very unique. You want to be able to search for all of the events across your huge distributed systems that correspond to the flow of a thing through your system. If you don’t log an orderId or requestId or traceId, all of the log lines that were part of that don’t have context, and they are useless. You’ll never find them, and they really shouldn’t be logged at that point.
Matt: That makes a lot of sense. And think about a troubleshooting workflow. Without a clue of where to go next, it would be frustrating at 3 a.m.
Mat: A great man once said, “A log line without context is sad and useless.” That was Ed. Pop that on a t-shirt straight away.
I’ve actually deliberately made the errors quite funny and silly. It’s kind of unnecessary, but I did start to say “try this, try this, try this” and link to things. It actually turned out to be quite useful. Do people do that? Is that good practice?
Ed: I think this should be a practice – consider who is going to read this message in the future and in what context? The more that you can give useful information in your error logs, that means the less someone has to copy that error log, open up their editor, and search the code base to figure out what they’re supposed to do. So write that error message for your future self.
“A great man once said, ‘A log line without context is sad and useless.’ That was Ed. Pop that on a t-shirt straight away.”
– Mat Ryer
How to use labels for logging at scale with Grafana Loki
Mat: When you start running systems at scale, suddenly all these logs really do start to add up. Like you said earlier, there’s a price to a log line, and I’d never really thought of that before. So when it comes to storing logs, what do we do? What’s the approach?
Ed: Logging does present a scale problem. And obviously I’m a bit biased but I think Grafana Loki is a great approach to solving this problem. With Loki, we’re thinking about it as a scale problem first and a search problem second. That’s where the indexing of your logs becomes important. If you build a big index, you have to store it and load it. And if your scale gets to a certain point, that index becomes its own scale problem. You have a thing that can search really fast if you can give it enough memory and resources.
With Loki, the index is intentionally kept small, and it’s just some metadata around where your logs came from to make it easy to find them. Instead of having a quick lookup like an index in the back of the book that says this word was on page 23, we have a table of contents that tells us the word is somewhere between pages 100-300. And then we tell 100 computers to go read all those pages at the same time. That has a resource cost, but the cost is transient, and you can control it. You can set how much parallelism you want to read at the same time and store a really small index.
Mat: So why are the indexes in Loki so small then?
Ed: Largely because it’s just some key-value pairs. The key-value pairs identify a log stream, which is the unique thing that we store. Logs come in as a stream, we bundle up those streams into blocks of a certain size and then flush them and store them in an object store. That key-value pair is how you find your logs.
Labeling in Loki is probably one of the most important things to understand and do correctly, because if you choose really high cardinality labels, you end up effectively building a reverse index. You end up having a label that will tell you exactly where your log lines are, but now the index gets really big and it defeats the purpose.
Mat: So what are examples of good labels that you’ve seen in Loki then?
Ed: The environment, the cluster, the application. In the Kubernetes world, we use `pod` as a label. Ideally you don’t want to churn more than 300,000 streams in a day. The goal should always be to think of labels as an intuitive way to find your logs. I like to approach labels by considering how I would go about finding the thing I’m looking for. And that usually maps pretty well to a good label set.