Observability in Go: Where to start and what matters most

Observability in Go: Where to start and what matters most

2026-04-068 min
Twitter
Facebook
LinkedIn

Sometimes the hardest part of debugging a system isn’t fixing the problem—it’s figuring out what’s actually happening in the first place.

In this episode of “Grafana’s Big Tent” podcast, host Mat Ryer, Principal Software Engineer at Grafana Labs, is joined by Donia Chaiehloudj, Senior Software Engineer at Isovalent (Cisco) and co-author of “Learn Go with Pocket-Sized Projects,” along with Charles Korn, Principal Software Engineer at Grafana Labs and Bryan Boreham, Distinguished Engineer at Grafana Labs, to talk about observability in Go.

They dig into where to start (hint: logs are often the first step) and how context, metrics, traces, and profiling fit together as systems grow more complex. Along the way, they share practical lessons on turning logs into metrics, avoiding common pitfalls with context and tracing, using pprof effectively, and what eBPF unlocks when you need visibility beyond your application.

You can watch the full episode in the YouTube video below, or listen on Spotify or Apple Podcasts.

Video

(Note: The following are highlights from episode 8, season 3 of “Grafana’s Big Tent” podcast. This transcript has been edited for length and clarity.)

Starting a Go project: Where observability begins

Donia Chaiehloudj: I would go simple to start. We know that we are always refactoring along the way and that priorities change, like real life. But I would try to go for the Go standard library as much as possible, because we know that it’s stable and not going to be archived tomorrow.

I would also go for well-known libraries that are not standard, but are used by a lot of people and are well-maintained, even though we know that contributors are less and less in the open source world. I would also think about some standardization from the beginning—for your data, your context, the way you want to trace, that kind of thing. 

Mat Ryer: Yeah, I think that makes sense. I like your point that you're going to refactor. Things are going to change. That kind of takes a bit of pressure off. It doesn't have to be perfect straight away.

Starting with logs—and turning them into metrics

Charles Korn: The thing I use most often, at least in the stuff that I'm working on at the moment, is logs. From logs, you can derive metrics if you really need to, so that's probably where I'd start. They're really easy to get started with. You can dump them into a file, you can dump them to the console, and you can start shipping them off to a system like Loki.

Mat: Yeah, I think starting with logs is quite natural.

Charles: We've got a bunch of Go services at Grafana Labs, and unfortunately, occasionally they panic, and they dump the trace to their logs and they get stuck to standard error, and they get picked up by our logging system.

And it's really useful to be able to show that on a graph—how often a thing's panicking. We actually have a system where it'll look at the logs, count the number of things that look like a panic, and turn that into a metric. And then we can alert on that metric just like any other metric. That's really helpful.

Mat: Yeah, so you literally then get a graph that shows you how many panics you're having.

Charles: Exactly. And you can have alerts on that.

Tracing and why context matters

Bryan Boreham: Tracing adds that explicit parent-child relationship, and everything's always got a beginning and an end. So I think tracing is kind of the superpower to figure out, with any complicated program, what happened. 

Mat: So the idea being it's spending more resources in that bit, and therefore, if you're going to optimize something, go for the big-hanging fruit, would you say?

Bryan: I suppose so.

Mat: So you mentioned that you would do that only in advanced projects or complicated projects. How do you know when it's time to reach for tracing?

Bryan: Well, for myself, I said 20 or 30 lines, but it gets complicated. So my bar is very low. My ability to concentrate on things is quite poor.

It also depends, because tracing is quite complicated, or people find it complicated to set up. With logs, you just stick it in a file and then read it, so it's orders of magnitude different.

But tracing really comes into its own when you have multiple bits in what you call a distributed system, multiple frontend and backends, or multiple bits of backend, or something like that. You pass the same idea around to everything because it's related, and then they're all logging or they're all reporting using that same idea, and that allows you to then tie the whole process together across all these multiple systems.

Mat: Yeah, it's very cool, and of course makes sense at scale.

Errors, tradeoffs, and observability in Go

Charles: One thing I do miss sometimes coming from other languages is that you've got an exception type, and each of those exceptions is a particular type. It's a file-not-found error or a network error or whatever it is.

Whereas with Go, most of those things are just strings. So if you're going to do any kind of analysis, like how many file-not-found errors did I get, that could be quite tricky in Go because they're just a whole bunch of strings.

But at the same time, it makes it really simple to create these really rich errors. They're really easy, as an engineer trying to solve a problem, to get that context of what's going on. It's a bit of good and bad.

Donia: So I started my career with Go. And for me, it was very natural to have error types. That was like, "no, I want to create new types of errors" if I had something specific. And it was a reflex to just check if there was the type of error that I wanted already in the library.

I did one year of Java in a company and I was playing with exceptions, and I was confused, actually. I want to define my own error type, because it's something very specific. And I want to type it for that type of library that I'm dealing with.

So what Charles is talking about is really interesting—the way exceptions can be, in general, easier maybe. But I find that error types and being more granular is easier to read in the code and to understand when you're debugging, too.

Profiling with pprof

Mat: So then profiles. I know Go has pprof. What is pprof?

Charles: It's a tool that allows you to measure the performance of your Go application. And it can show you a bunch of different profiles. The ones that I use most often are CPU time. It's literally just how much time is spent in different functions. And the other one I spend a lot of time looking at is memory consumption, like peak in-use memory consumption.

Donia: I was very intimidated at the beginning of my career by pprof, actually. Do you have any advice for someone getting started with it?

Bryan: I was just thinking to myself, actually, that there were one or two gotchas. The big one that catches some people is that they dive into CPU profiling when they don't actually have a CPU problem.

They're not running out of CPU. They've got a program that's slow, and so they think, "Oh, profiling." Then it turns out that this program is slow because it's waiting on some other program, like a database, and a profile will not show you that.

The simplest way to watch out for it is to watch your CPU meter. If it's ticking along at sort of 0.1 CPU usage or something like that, then it's very unlikely that profiling is going to get you anywhere. Whereas if the fans are all running, 18 CPUs going in parallel, then that's probably a good one to point the CPU profiler at.

The next thing is that it's almost always memory allocation in Go that is causing issues. If you do have a CPU problem, look at the memory profile, is my next top tip.

eBPF and observing the “dark side” of systems

Donia: eBPF, for people who maybe don't know what it is, is a way to write C programs, BPF programs, in the Linux kernel to dynamically observe or secure your kernel.

That's very powerful. But it can be very daunting and out of cost to write BPF programs. So having Go wrappers on top of that is very interesting.

Something I personally like about eBPF is that you can actually access dark sides of your kernel that you can't access from user space.

What Go could improve

Charles: One thing that would be really helpful is if, when you put out a stack trace, it could say, “give me the pointer address of this pointer, and print out this value from that struct.” The other thing that's kind of related is errors. I'd love to be able to get a stack trace reliably for an error.

Bryan: I would love more flexibility, and it's probably more in the debugging tool than in Go itself.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com.

Tags

Related content