The evolution of OpenTelemetry: A deep dive with co-founder Ted Young

The evolution of OpenTelemetry: A deep dive with co-founder Ted Young

2026-02-138 min
Twitter
Facebook
LinkedIn

Sometimes the biggest challenges in software aren’t about code — they’re about consensus. What do we call things? What do we standardize? And how do you evolve a system that thousands of companies depend on without breaking everything along the way?

In this episode of “Grafana’s Big Tent” podcast, hosts Mat Ryer, Principal Software Engineer at Grafana Labs, and Matt Toback, VP of Culture at Grafana Labs, sit down with Ted Young, co-founder of OpenTelemetry and Developer Programs Director at Grafana Labs, to discuss the evolution of observability standards, the realities of tracing adoption, and why OpenTelemetry can feel both painstakingly deliberate and relentlessly fast.

They also unpack the “three pillars” myth, the hidden complexity of logging, language-level constraints, and the push towards instrumentation that is truly zero-touch.

You can watch the full episode in the YouTube video below, or listen on Spotify or Apple Podcasts.

Video

(Note: The following are highlights from episode 6, season 3 of “Grafana’s Big Tent” podcast. This transcript has been edited for length and clarity.)

Setting the stage: OpenTelemetry and standardization

Ted Young: OpenTelemetry is a telemetry system. You can think about it as a new way of dividing up the observability stack. We used to divide the stack up by signal. So you would take one style of observability, like tracing, logging, metrics, or profiling, and then you'd be like, “I'm going to make a logging system.” You would make a logging format, and then make a logging API and logging client to generate those logs and process them. And then you’d make a database to store those logs and a UI for looking at the logs in that database. And that would be your logging system. But then someone would come along and want a metric system. So, they would redo all of that work, but just for that metric system.

The new way of dividing it up is to say there's telemetry, there's applications and systems, and computer resources. And they're all self-describing. They're all just describing what they are doing. And when they're describing what they're doing, they are all trying to speak the same language. Because when you're trying to understand what they're doing, you're never looking at any of these pieces in isolation. You're looking at them together. So if they can all speak the same language to describe what they're doing, then it'll be a lot easier to form a comprehensive picture.

And since we're talking about describing what computers are doing — mostly what computers are doing are standardizing, like talking over network protocols. It stands to reason that it should be possible to standardize the language that systems use to describe themselves. If you wanted to do that, you would want to decouple it from things like analysis and storage and user experience stuff. All of that is greenfield.

And so that's kind of the new way of slicing up observability: saying we have telemetry, which we want to standardize, and we all want to share. 

Tracing, logs, and the myth of the “three pillars”

Ted: Traditionally, logging and distributed tracing get talked about like they're two totally separate things, but that's really kind of a human historical accident... We're discovering that tracing is just logging with the context that you always wanted for your logs. Like, "What transaction are these logs a part of”? It's crazy to me how many decades went by without us being able to just look up the other logs in the same transaction.

Mat Ryer: So you’re like that guy at the party who, you know, comes in with the more existential questions like, “What if just logs and events were really traces”? Everyone's like, “Go home.”

Ted: It's kind of the opposite. I hate being existential about this stuff. I feel like I struggle uphill against ideology because people have already pre-bucketed what these tools are useful for… Tracing is like a latency analysis tool. And it can be really expensive, which means you have to do a lot of sampling. And because you have to do a lot of sampling with tracing up front, sampling's really not useful for, like, firefighting or root cause analysis because you just don't have the data, unlike in your logging system, where you keep everything.

The only reason tracing is heavily sampled is because people were trying to add it on top of a really expensive logging system that stored everything, but had no trace context. But they couldn't touch that system and add trace context to it, so it’s almost like they built a second logging system called tracing. And since that thing had only a tiny amount of resources left over from the logging system, it was heavily sampled to do some latency analysis. So that's just a historical accident that we built it that way. That's not because tracing and logging are really different from each other.

High latency, high throughput

Ted: OpenTelemetry is this curious creature. I consider OpenTelemetry's unofficial mascot to be the racing snail from the Never-Ending Story, if you remember that thing. OpenTelemetry is high latency, but high throughput, which means when you go and engage with it on any one particular thing, you're like, “This is so slow, designed by committee.” It can really feel that way, because unlike a lot of other open source projects, OpenTelemetry doesn't really get any take-backs. Generally speaking, if we put it out there and it gets adopted and we break it, people hate us forever. Much more strongly than most things, where if it's new and we break it, we're just like, “Well, it's like new open source.”

Mat: Doesn't everyone feel like that? If someone loves it and you take it away?

Ted: No, no, no. I feel more so than other projects, OpenTelemetry gets punished hard for breaking backwards compatibility. At certain levels or layers, we have to be very strict about it. With the API or data layer, in particular, if we create any kind of dependency conflict where, say, this library depends on OpenTelemetry 1.0 and this other library depends on OpenTelemetry 2.0, now these libraries won't compose because of OpenTelemetry. So we really have to care about it.

Why didn’t tracing take off sooner?

Matt Toback: Do you think that if the fundamental architecture was different, distributed tracing would have gotten traction sooner?

Ted: I think the hardest thing with distributed tracing is installing it because the context that you're getting from it is the execution flow that you're following. But language runtimes don't really give you a way to follow that execution flow effectively from an observability standpoint. If you just rely on thread-locals in languages that have threads, you might get some of it. But work often switches railroad tracks, right? Like, work will move from one thread to another, or a set of threads will get organized into some kind of scatter-gather thing or something like that. 

The trickiest part of OpenTelemetry in every language is this context propagation mechanism that we have to come up with. And then every single piece of instrumentation has to use the same cross-cutting mechanism. That's harder to roll out and get value out of than logging or metrics, where I can just roll out metrics in this corner just for me.

If we're saying the value of tracing is this distributed context, that means you have to roll tracing out across all of those services before you get that value. So that was kind of the problem. When you combine the amount of work with people thinking, “I'm going to have to rip all this out and replace it if I ever switch vendors,” it's just kind of like dead on arrival outside of an organization like Google or Microsoft or Xerox Park where there's just this enormous internal engineering culture that can make it worthwhile to go ahead and do that. So that was the real blocker.

Moving towards zero-touch observability

Ted: The future I would like to see is that no one has to install any OpenTelemetry instrumentation anywhere because when people write software, they're thinking about how their software is going to be run. And they're instrumenting it and they're shipping a playbook for the people running their software, letting them know how to actually tune the configuration parameters that they gave them and all of that. So that's the world I want to see. A lot of people right now don't think about observability and I think, in part, it’s because they don't have tools to instrument their own stuff. Without OpenTelemetry, it's actually hard to do native instrumentation. So long term, I would like to see that and I would like to see everyone become more of an observability expert as a side effect of that.

Matt: Do you think AI is going to help us get there?

Ted: I feel like with AI, it's a 90% chance nothing changes except our tools are more complicated with a 10% chance that the world ends up in a Terminator disaster and like very little in the middle. I'm very bimodal in my AI predictions.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com.

Tags

Related content