The rise of agentic AI in production: Can observability systems run themselves?

Grafana Labs Team

•

2026-02-27•9 min

Sometimes the biggest shifts in technology aren’t about collecting more data — they’re about who (or what) gets to act on it.

In this episode of “Grafana’s Big Tent” podcast, host Tom Wilkie, Grafana Labs CTO, is joined by Spiros Xanthos, Founder & CEO of Resolve AI, Manoj Acharya, VP of Engineering for Observability at Grafana Labs, and Cyril Tovena, Principal Engineer on the Grafana Assistant team, to discuss agentic AI in observability.

They talk about automated root cause analysis (RCA), knowledge graphs, trust and skepticism among SREs, pricing challenges in an agent-first world, and the controversial question: how soon will agents run production systems on their own?

You can watch the full episode in the YouTube video below, or listen on Spotify or Apple Podcasts.

(Note: The following are highlights from episode 7, season 3 of “Grafana’s Big Tent” podcast. This transcript has been edited for length and clarity.)

From tools to operators: The Resolve AI thesis

Spiros Xanthos: My realization was that, despite having unlimited access to our own tools, the majority of our time at scale was actually spent in running and debugging and maintaining the system. So we, Yuri Shkuro and I, decided to try to think, “what is the way to do this in an AI-first way?”

Essentially, trying to build the agents that are not trying to replace the existing tools, but rather work alongside humans in using these tools to improve reliability of production systems, starting with being on call and being able to troubleshoot when something goes wrong. And you know, that's kind of the origin story. I do believe it's a very hard problem, but it's also something that is essential, especially in the era of AI coding, where we're going to be producing a lot more software. And we'll definitely need the assistance of AI in running it.

Tom Wilkie: Yeah, that's a really interesting point. I get the question a lot from our customers at Grafana Labs. Is agentic AI going to mean I can hire fewer engineers? I definitely feel like the answer is “no.”

Knowledge graphs, context graphs, and reasoning

Manoj Acharya: There are very powerful graph algorithms that only exist in centrality... and if you know where everybody's calling the service or that database, and then that database or that service is having trouble, then that's likely the root cause, you know? So that was kind of the genesis of the knowledge graph product itself.

Tom: Do you take any of these concepts in Resolve? Like, do you have the concept of a knowledge graph? How are you handling that?

Spiros: Yes. What Manoj was describing, in my opinion, applies in the era of agents, models, and tool calling. Because when you're dealing with a production system, first of all, production systems are not documented. Or rather, documentation lives in tools, human minds, and documents, and is often outdated. So it does help to have a context graph, as it's called these days, or a knowledge graph, that tries to tie these things together. It doesn't have to be perfect in my opinion, but it gives something that the agents can use to reason about dependencies, changes, and what might affect a problem that you see. A user might be experiencing something, but it might be deriving from a database that is maybe three or four steps down. So Resolve tries to essentially reconstruct that context graph, let's call it, and maintain it offline, but also the agents have the ability to navigate the production system in real time and have these paths created as the agent reasons about problems or tasks.

A deadlock bug — and how AI found it

Spiros: I think the most interesting problems are things that I've seen many times. And I've seen some even in our own environment, where you usually have a latent problem. Maybe there is a code change. Maybe there is something that has changed in the environment. Maybe a load test breaks something slowly. And I think that's usually very, very hard for humans to dissect.

Oftentimes, it's even a concurrency problem, right? Which, by itself, is very hard. And I think I've seen now, over the last few months, our agent becoming better and better at removing the constant noise that exists in the background and actually getting to these causality chains that are very, very hard for a human to troubleshoot. One example in our environment: there was this bug where somebody introduced a function that was calling a database to update something without deterministic lock ordering. And then a few days later, we received a lot of events that made us spin up automatically a bunch of pods to process all these events. They all started hitting the database and created a deadlock.

And our team was trying to figure out what was going on in the moment, in the panic moment. Somebody started debugging by themselves, and Resolve was running in parallel. And it ended up finding this latent bug that was three days later manifesting as a deadlock issue. And even more impressively, what it did was — that code was changed actually multiple times after the moment of the bug, which was three days before. But none of the new changes actually changed the behavior, right? So it was able to dissect that as well.

I think it's very, very important to provide evidence, and if the agent doesn't have high confidence to not try to say it does. Because in a similar way that this is impressive, it can also lead you down the wrong path very easily.

Trust, grounding, and skeptical engineers

Cyril Tovena: I feel like it's a bit the same as any other tech, where you have early adopters and laggards, and obviously laggards are much more skeptical and early adopters are hyped. And I think where it makes a big difference is the accuracy and the trust that the engineer has.

I found this technique to be quite nice: always ask the LLM to cite its findings, because if you don't, then it's definitely going to hallucinate an answer. Because that's the thing with LLMs, they always want to answer.

We had a lot of skeptics at the beginning of AI in the company. And I think the Claude Code moment was probably the moment where everyone was like, yeah, I think I get it now.

Spiros: I agree. It's happening a lot faster than other similar kinds of transformations. My thesis at the moment — and I see this with Resolve — is that agents for observability and production are going to be working as well as agents for coding. And in fact, it is a harder problem, right? So potentially, the value here is more, considering all the software we have out there. It exists and all the new software we're creating now with AI, right? And I do think it's going to happen, like that breakthrough moment you were describing as the Claude Code moment, sometime this year for AI that is working in production and observability.

What’s next?

Tom: How far out is it that software agents run software for you, fully autonomously?

Spiros: I'll give you my view of how the future might look — a version of it at least. And I would say, how do you hold a human accountable also, to some extent, for mistakes? When we started two years ago, people were very skeptical of an agent in production at all. Or an agent working on the problem. I would say over the past year, that has been removed. Coding agents, especially, paved the way. I think right now, people are very comfortable with agents doing their work and maybe a human having the final say. Over the next few quarters, this next year, I think we're going to remove the need for a human to be in the loop for the agent to take the final action, potentially for debugging a problem or performing a task in production. I do expect maybe the majority of incidents that happen to have an automated resolution by an agent by the end of the year. So in a way, maybe by the end of the year, agents are going to be on par in terms of abilities for this particular problem with, maybe, a senior software engineer.

Of course, we have to be thoughtful on the way there and the consequences. But I am an optimist in terms of what happens next.

Cyril: I'm also quite optimistic about this. And I think it's already happening right now. We can see it in the coding space. Again, I like to relate to the coding space because it feels like it's just a bit in advance compared to AI for observability. And you can see now with those feedback loops that the agent is actually autonomously writing tests and then going into the loop until the test is passing.

The end goal is definitely to try to automatically resolve incidents. But I do think there is a path to that. You need to first show the customers that you can trust the software. And then once you are there, it will become clear that you can also give access to maybe rolling back or changing your code directly. It's the same story that we had internally. I remember when we started using Cursor, we had a lot of people who were against using background agents. But then, over time, with trust, it just became natural.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com.

The rise of agentic AI in production: Can observability systems run themselves?

From tools to operators: The Resolve AI thesis

Knowledge graphs, context graphs, and reasoning

A deadlock bug — and how AI found it

Trust, grounding, and skeptical engineers

What’s next?

Up next

Related content

Related videos

Related docs

Related products

Still have questions?

Get every update

The rise of agentic AI in production: Can observability systems run themselves?

From tools to operators: The Resolve AI thesis

Knowledge graphs, context graphs, and reasoning

A deadlock bug — and how AI found it

Trust, grounding, and skeptical engineers

What’s next?

Related Content

Up next

Related content

Related videos

Related docs

Related products