A tale of two incident responses: How our AI assistant found the root cause 3.5x faster

• 2025-11-17 • 8 min

About two months ago, an incident at Grafana Labs was kicked off in typical fashion: A series of alerts were triggered, our on-call engineer acknowledged it on Slack, and the rest of the team quickly began hypothesizing about the potential culprit.

Slack notification showing a critical alert for an Assistant API latency spike. Includes details like SLO type, owner, and Grafana link.

But the way the incident was resolved was anything but typical.

Yes, our internal team followed best practices to resolve the incident as quickly as possible. But at the same time, Grafana Assistant Investigations, an AI-powered tool we’d been developing internally to accelerate multi-step incident investigations in Grafana Cloud, quickly got to work doing the same task.

So while our on-call engineers were digging through dashboards and logs, the AI assistant quietly spun up its own background investigation. Eight minutes later, it found the root cause, 20 minutes before our on-call team came to the same conclusion.

Fast forward to today, and Assistant Investigations is now available to all Grafana Cloud users in public preview, giving you the same functionality to assist, not replace, your engineering team so it can focus on delivering great products instead of putting out fires.

Two paths, one root cause

Before we take a closer look at how this all played out, let’s quickly get you up to speed on Assistant Investigations.

It coordinates a swarm of specialized AI agents to analyze your observability stack, diving into your metrics, logs, traces, and profiles to find anomalies and build up a picture of your system.

Assistant Investigations collects evidence in parallel, generating findings and hypotheses so it can provide actionable recommendations for mitigation and remediation. And since it’s embedded directly into Grafana Assistant, our AI chatbot designed specifically for Grafana, you get a seamless, guided workflow for resolving complex incidents.

OK, now let’s get into the results.

Bar chart comparing incident response times: AI assistant at 8 minutes and on-call engineer at 28 minutes, with AI being 71% faster.

In the end, it turned out that a small part of one of my PRs, an AI-generated SQL query, was indeed the culprit. I’d written it to analyze Assistant usage. It worked fine in staging, passed CI, and even got reviewed thoroughly, but in production, it broke the database.

And to be fair, most of the time was spent determining if it was indeed our deploy that was causing the alerts to go off. But still, this highlights a major benefit of Assistant Investigations: It can start an investigation (and, in this case, finish it) before the people on-call even get fully up-to-speed with what’s going on.

The incident: Some AI-generated code makes its way to production databases

That “harmless” SQL query for cohort analysis created an unbounded join that ballooned under load, saturating database connections and throttling CPU.

Our on-call engineers reverted deploys, checked SQL logs, and restarted affected clusters.

At the same time, Assistant Investigations ran in the background, testing multiple hypotheses in parallel. It looked at logs, metrics, traces, and profiles to compare query volume, latency spikes, and recent deploy diffs. Ultimately, it accurately flagged the /cohort-analysis endpoint as the likely culprit.

Just like an SRE team would during an incident, the AI system deployed specialized agents to investigate different hypotheses simultaneously.

A dashboard showing agent activity with a timeline and a flowchart illustrating multiple specialized agents investigating different hypotheses.

What you’re seeing in the image above:

Multiple investigation branches running in parallel (not sequential)
Specialized agents for different data sources (Prometheus, Loki, Tempo)
Hypothesis testing with confidence scores for each theory
Cross-correlation between metrics, logs, and traces

We use specialized agents that work together to get to the root cause: metrics agents checked MySQL connection pool metrics, logging agents validated deployment timing, and tracing agents analyzed database transaction patterns. Working in parallel, they built a complete picture of what went wrong, 3.5x faster than our on-call team.

Chat exchange about using investigations for an incident, with a link to the investigation results and a comment on finding the issue.

From investigation to action: the AI delivers results

Grafana dashboard showing AI analytics with sections on root cause, recommendations, and evidence. Graph displays performance metrics over time.

Key elements from this Assistant Investigations report:

Root cause identification with specific technical details
Confidence scoring (0.91 confidence, in this case)
Contributing factors breakdown (slow queries, connection holds)
Actionable remediation steps with monitoring guidance
Evidence trail linking metrics to the actual problem

The investigation didn’t just identify the problem—it provided actionable recommendations, confidence scores, and supporting evidence. This was all included in one central location (as shown in the image above), but I do want to call out two of the more important lines of information it shared:

Root cause confirmed: “Cohort Analysis API exhausted database connection pool across all clusters.”
Contributing factor: “Slow tenant limit queries taking 3-4 seconds creating sustained connection hold times.” Plus specific remediation steps and monitoring guidance.

AI-generated code is here to stay. How can it be managed properly?

Almost every team now has some kind of AI-authored code running in production.

Sometimes it’s a quick SQL query, sometimes an auto-generated handler or config file. It’s faster to ship, but easier to misunderstand.

The result is that companies today are building systems and products faster than they can truly understand them at their core.

Velocity without comprehension: More code, less context
Automation bias: AI output looks right, so we assume it is
Fewer domain experts: Everyone’s moving faster, but knowing less deeply

This AI-driven productivity is a net positive. And in an ideal world, every AI-generated line would get the same meticulous review as human code. In practice, that’s not sustainable—and problematic code will slip through, whether written by AI or humans.

We built Assistant Investigations to scale your understanding, not just your monitoring. They take the noisy, chaotic first half of an incident and turn it into an evidence trail. And it comes complete with metrics, logs, traces, profiles, MCP integrations, query plans, and confidence scores—before you’ve even joined the call.

AI SRE tools don’t replace engineers; they amplify them. Humans still make the calls, but now we make them faster, with better data and fewer assumptions.

Try Assistant Investigations today

Incidents aren’t slowing down and neither is AI. Assistant Investigations helps your teams keep pace, turning alerts into data and context-backed investigations before your Slack messages start popping off. Try your first investigation in Grafana Cloud today.

FAQ: Grafana Cloud AI & Assistant

What is Grafana Assistant?

Grafana Assistant is an AI-powered agent in Grafana Cloud that helps you query, build, and troubleshoot faster using natural language. It simplifies common workflows like writing PromQL, LogQL, or TraceQL queries, creating dashboards, and performing guided root cause analysis — all while keeping you in control. Learn more in our blog post.

What is Grafana Assistant Investigations?

Assistant investigations is an SRE agent built directly into Grafana Assistant. It helps you find root causes faster by analyzing your observability stack, uncovering anomalies, and connecting signals across your system. You get clear, guided recommendations for remediation — and because it’s embedded in Assistant, it provides a seamless, end-to-end workflow for resolving complex incidents.

How does Grafana Cloud use AI in observability?

Grafana Cloud’s AI features support engineers and operators throughout the observability lifecycle — from detection and triage to explanation and resolution. We focus on explainable, assistive AI that enhances your workflow.

What problems does Grafana Assistant solve?

Grafana Assistant helps reduce toil and improve productivity by enabling you to:

Write and debug queries faster
Build and optimize dashboards
Investigate issues and anomalies with Assistant Investigations
Understand telemetry trends and patterns
Navigate Grafana more intuitively

What is Grafana Labs’ approach to building AI into observability?

We build around:

Human-in-the-loop interaction for trust and transparency
Outcome-first experiences that focus on real user value
Multi-signal support, including correlating data across metrics, logs, traces, and profiles

Does Grafana OSS have AI capabilities?

By default, Grafana OSS doesn’t include built-in AI features found in Grafana Cloud, but you can enable AI-powered workflows using the LLM app plugin. This open source plugin connects to providers like OpenAI or Azure OpenAI securely, allowing you to generate queries, explore dashboards, and interact with Grafana using natural language. It also provides a MCP (Model Context Protocol) server, which allows you to grant your favourite AI application access to your Grafana instance.

Why isn’t Assistant open source?

Grafana Assistant runs in Grafana Cloud to support enterprise needs and manage infrastructure at scale. We’re committed to OSS and continue to invest heavily in it — including open sourcing tools like the LLM plugin and MCP server, so the community can build their own AI-powered experiences into Grafana OSS.

Does Grafana Cloud’s AI capabilities take actions on its own?

Today, we focus on human-in-the-loop workflows that keep engineers in control while reducing toil. But as AI systems mature and prove more reliable, some tasks may require less oversight. We’re building a foundation that supports both: transparent, assistive AI now, with the flexibility to evolve into more autonomous capabilities where it makes sense.

Where can I learn more about Grafana’s AI strategy?

Check out our blog post to hear directly from our engineers.

What’s the difference between AI in observability and AI observability?

AI in observability applies AI to operate your systems better and refers to the use of AI as part of a larger observability strategy. This could include agents baked into a platform (e.g., Grafana Assistant in Grafana Cloud) or other integrations that help automate and accelerate the ways teams observe their systems.

AI observability is the use of observability to track the state of an AI system, such as an LLM-based application. It’s a subset of observability focused on a specific use case, similar to database observability for databases or application observability for applications.

Grafana Cloud does both: AI that helps you operate, and observability for your AI.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

A tale of two incident responses: How our AI assistant found the root cause 3.5x faster

Two paths, one root cause

The incident: Some AI-generated code makes its way to production databases

From investigation to action: the AI delivers results

AI-generated code is here to stay. How can it be managed properly?

Try Assistant Investigations today

FAQ: Grafana Cloud AI & Assistant

Related content

How to monitor AI agent applications on Amazon Bedrock AgentCore with Grafana Cloud

Baking in site reliability with observability and AI: How SpotOn uses Grafana Assistant to keep...

AI-powered observability: Resolve incidents faster, reduce alert fatigue, and expand access