OpenTelemetry: past, present, and future
Here’s something most people probably didn’t have on their 2024 bingo cards: the terms “gossip” and “scandalous details” popping up on an episode of “Grafana’s Big Tent." But if you did, congratulations!
It happened during a conversation about OpenTelemetry, in which co-hosts Mat Ryer, Grafana Labs Engineering Director, and Matt Toback, Grafana Labs VP of Culture, playfully tried to get some scoop about the inner-workings of the OpenTelemetry Governance Committee from two of its members, podcast guests Juraci Paixão Kröhling, Grafana Principal Engineer, and Daniel Gomez Blanco, Principal Engineer at Skyscanner. (Truth be told, the most scandalous detail to come out of the episode might be that Matt still owns a zip drive cable . . .)
You can read some of the show’s highlights below, but listen to the full episode to hear more about instrumentation and distributed tracing, as well as OTel history — including its early names — TBT (trace-based testing), ODD (observability-driven development), and Mat’s Broadway moment.
Note: The following are highlights from episode 9, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.
The basics of OpenTelemetry
Mat Ryer: For anyone not familiar with OTel, what is it?
Daniel Gomez Blanco: OpenTelemetry is a framework that allows you to instrument, to collect, to process, and then to transport telemetry data. It’s focused on cloud native systems, and one of the major goals is that you can produce all that telemetry from any application, any library, any language, in a standard format and using a standard set of naming conventions as well, and then push that to any backend. It’s vendor-neutral, so you can do whatever you want with your telemetry. You’re not forced to use a particular solution or backend.
Mat: Before that, we had all different systems, and they have different traditions, different names, different techniques. So this is a goal to unify it all and get everyone behind one set of standards.
Daniel: Yeah, it’s one set of standards as well as something that is really, really important for observability, which is context. We’re no longer thinking about your metrics here, and your logs here, and your traces here. You have everything as part of one set of telemetry data that is correlated in order to allow you to get better insights from your applications.
Juraci Paixão Kröhling: One of the nice things about OpenTelemetry is that it came to be when existing tools were very successful already, so we tried to make open telemetry play nicely with the things that came before us. Of course, not everything is 100% philosophically compatible, but we tried to make it so that people could use OpenTelemetry even if they don’t like every part of it, or even if they were already happy with other parts.
Learning the rules
Mat: You mentioned logs, metrics, and traces. As an everyday developer, how do you know which of these telemetry types we should use? Is there a set of rules, or is this intuition and experience that you have to learn over time?
Daniel: It depends so much on the system. The first question to ask is if you should be adding that telemetry yourself, or if you should be relying on telemetry that is emitted by instrumentation libraries or the libraries themselves? One of the things that OpenTelemetry enables is instrumentation authors to write that telemetry for you, as well library authors. So let’s say that you pick an open source library that comes already instrumented with the metrics that you need, or with the spans and the tracing instrumentation that you need — you don’t need to add that yourself.
When we’re talking about instrumenting our own data, our own applications, I think it’s important to choose the telemetry that is right for your use case. So if you’re looking to drive, for example, alerts or dashboards or something that you’re looking at long-term, you’re probably looking at time series data so that you need your metrics. If you’re looking at really high granularity insights, and you’re interested in how it all fits together in a complex, distributed system, and you’re interested in context, then you want traces. And then if you want to integrate with some of the legacy systems that, perhaps, didn’t produce any spans or any metrics, then you want to rely on logs.
But the important thing is that you get all those parts of the context, the same set of correlated data. And that’s something OTel helps with — to bridge that gap.
Another way of thinking about it is going into multiple different levels of granularity. You may start from metrics, and then you want to correlate that to your traces, and then maybe keep going down in your level of abstraction and then going down into profiles.
The story that we’re trying to tell is that you need all that data and you need it to be correlated.
Juraci: The mental model I have is a little bit different. When I’m developing an application, I try to put myself in the future — like at 2 a.m. during an outage, looking at my code and thinking, What do I need to know to understand what’s going on? It’s very likely that I don’t need the individual log entries. When I need aggregate information, then I need metrics and I can think, What are the metrics that I need? Do I need a gauge for the size of the queue that I’m handling right now? Do I need a latency for the outgoing HTTP requests? Things like that.
The second mode is, What do I need to understand the context of the transaction specifically? When I’m thinking about the transaction as a whole — microservices — then I know I need a span. I know I need distributed tracing. And when I’m thinking about the neighboring services, the one thing that I know that I need is one span representing my incoming HTTP request or my incoming RPC, and one span representing my outgoing RPC. So that’s all that I really need when it comes to distributed tracing, because then I see the whole chain in a trace. I feel like logs are mostly for the life cycle events of my application.
Making a game out of learning
Matt: A lot of times, when we talk about how to instrument your application, there’s this feeling that you’re choosing instrumentation for your future self to troubleshoot. How often do you see differences in team approaches on the same application?
Daniel: It depends on what teams are used to. One of the things that I’ve been doing recently with some of the teams of Skyscanner is going through the OTel demo and doing a bit of a game. We put them in front of the demo, we inject some failures into that system, and then ask them to go and debug it. We try to have different teams compete against each other and see who gets to the root cause first.
You get to the root cause a lot faster if you use context than if you start from, perhaps, an alert that we set up that’s driven from metrics. Just by looking at the metrics, you won’t be able to answer the questions “What is the root cause here? What made it fail?”
Matt: Have you seen eureka moments when someone discovers that they’re able to see more than they were before?
Daniel: There was an engineer who was adamant that they were able to debug this through logs alone, because that’s what they were used to. And they didn’t win. People in another room were using tracing and were a lot further advanced in finding the root cause, so that was fun.
Why some people still don’t understand observability
Juraci: I also think about telemetry when writing code, but I think there is some barrier to entry there. Out of curiosity, I was looking at the curriculum for some computer science classes in Brazil — but I think this applies elsewhere — and we teach people how to write operating systems, but we don’t tell people how to monitor them. We don’t teach them how to do observability, like how to understand what’s going on, and this is a real gap.
People have a hard time understanding distributed tracing. It is a hard topic. It is difficult to imagine that parts of that data are going directly to a collector somewhere, and parts of the data are being propagated down as part of the RPC. So which one is which? What is where? How does that actually work? Why do I need a trace context there with two types of IDs and flags? Those are things that we just assume that people know, and if they don’t, we provide the tools to them. But then they use them without knowing what’s in there, and it makes it very hard to debug when things go wrong with the telemetry systems they have.
On one hand, I see that the tooling to develop instrumentation is necessary. On the other hand, I think we end up having a huge cost problem, which is already a reality in observability by having tools that are generating a lot of telemetry data that we don’t actually need. When we look at most traces that we’ve collected, we don’t use most of them. A few years ago, someone mentioned close to 90% of traces are created, transmitted, and stored, and never seen — all of that work for pretty much nothing.
If we were to have only manual instrumentation, then we would have only high valuable data being stored, and way less data than what we have right now. But because it is so difficult to understand and to manually instrument, we end up using those very powerful tools that generate a lot of telemetry that we might not need in the hopes that they become useful sometime in the future.
Matt: How often will you prune it?
Juraci: We never do. I think it’s in our nature to not delete something because of fear that that thing might be useful in the future. I think this is where academia is way more advanced than we are. There is some research going in this area of turning on and off instrumentation points based on the current state of the systems. I think the future is not that we are not going to instrument things as much as what we are doing today, but I think the future is whether we collect and store and transmit that data in the first place.
The need for a cultural shift
Mat: What is the biggest challenge to adoption, that people just don’t really know it, or are there other resistances?
Daniel: One of the challenges is that people don’t know it. You need to change years of culture and people within a distributed system operating them as if it were monoliths — like isolated entities — without thinking of your overall system. And I think that’s something that needs to change from an engineering perspective, or, like, how do we operate systems.
There is, as well, another challenge for adoption, which tends to be cost-related, because we generate a lot of telemetry data from distributed systems. Before you perhaps had one big replica that was massive, now you’ve got little components that all produce telemetry. I think it’s important as well to understand that distributed tracing and that context allows you to make better decisions in what to keep and what not to keep.
Cost is a challenge, but is not an insurmountable challenge.
Juraci: I think there’s also the situation where people only see the value when they have experience. It takes a while for people to realize that they need observability, and they might not even realize it even after years of experience.
Weighing the trade-offs of using OTel
Mat: Are there downsides to using OTel, or trade-offs that we have to be aware of?
Daniel: It depends where you are in your observability journey as an organization. So if you’ve got really stable processes, and there may be something in OTel that is not quite stable yet, there are parts that are very stable that you could adopt and try to simplify and consolidate on that. There are other parts that are less stable, for example, things related to profiles. So I think if you’ve got something that is currently working, it’s probably one of the things where you wait until things get a bit more stable.
If you’re completely greenfield and you’re starting from scratch, your balance will probably tip towards innovation.
Juraci: I think it’s all relative as well. The collector has not reached a v1 yet, so in theory, it is not stable. But it doesn’t stop people like Vijay [Samuel] from eBay from implementing a highly scalable collector pipeline there and giving talks at KubeCon.
We think it’s fine for some things, but not so fine for other things.
Daniel: It does require a little bit of due diligence from a platform engineering team. At Skyscanner, for example, we run collectors at scale as well, handling more than a million spans per second, and handling hundreds of thousands of data points per second. The collector is not v1, but the components that we rely on — OTLP, receive, and export — and the processors that we use are stable.
Governance Committee scoop
Mat: Tell us a little more about what you do on the Governance Committee, in particular if you’re working on anything interesting right now. Or I’d love to know any gossip about big places you disagreed, or something from behind-the-scenes.
Daniel: I don’t think we disagree much. We’re all quite aligned.
Juraci: All our calls are recorded, so the meeting notes are public and the recordings are available — it is an open source project after all. We do have private sessions for things that are really related to code of conduct or complaints from people from the community. It is a big community, so disagreement is bound to happen. Our group is there to balance that, and to make sure that we are moving forward as a community and that we are not ignoring problems that may cause even bigger problems in the future.
But I think our biggest challenge is even though OpenTelemetry is full of people from vendors — most people there work for vendors — even though we have that commercial interest behind the project, we’re all, at the end of the day, volunteers. Most of us end up doing things during our working hours. We are being paid for that, but this is not something that we put on our OKRs.
The difficult part as part of the GC, of the governance committee, is to agree on a roadmap, and convince other people that they have to pay attention to what we are saying, as “This is what we believe to be important. Not only listening to us, but also following the path that we are trying to go into.”
We are software engineers, so we find new toys every single day, and we want to play with those toys. At this point, my candid opinion is that we have too many toys and we have to do some spring cleaning and see what is worth keeping and what is not, and what we should declare as something that is a good idea, but for the future.
What’s coming up for OTel
Mat: Speaking of new toys, what is new? What’s the exciting thing that’s next for OTel?
Daniel: One of the things that’s got everyone excited is profiling as a new signal, and how you can get profiles correlated with your traces. That is just going to open such a new avenue to explore within your debugging practices.
Juraci: I’m looking forward to so many things. I think one of them is the Entity SIG. I like to think that is our first big refactoring in OpenTelemetry, and that is refactoring the idea of resource attributes. Resources is the context that Dan mentioned before. It ties all of the signals together.
It turns out that not everything that we thought would be good values for resource attributes are actually good. They might be problematic for things that don’t deal very well with high cardinality, like Prometheus, so we probably do not want to store the process ID as part of resource attributes. It’s nice metadata to have about the process, but we don’t actually need it there.
The Entities SIG would break down the resource context into what is the identity resources and identity SIG — so what is the identity of my resource and what is the metadata for my resource, so that this linking context between the signals would be very thin, and I could use those in Prometheus and Loki or in different data stores, and the metadata could be stored elsewhere. It doesn’t have to be indexed. It doesn’t have to be part of the identity of the object.
It’s not as exciting as profiling, perhaps, but I think it is very necessary, and it’s a sign of maturity that we are doing a refactoring.
“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.