A tale of two incident responses: How our AI assistant found the root cause 3.5x faster

Ryan Perry

•

2026-01-29•6 min

About two months ago, an incident at Grafana Labs was kicked off in typical fashion: A series of alerts were triggered, our on-call engineer acknowledged it on Slack, and the rest of the team quickly began hypothesizing about the potential culprit.

Slack notification showing a critical alert for an Assistant API latency spike. Includes details like SLO type, owner, and Grafana link.

But the way the incident was resolved was anything but typical.

Yes, our internal team followed best practices to resolve the incident as quickly as possible. But at the same time, Grafana Assistant Investigations, an AI-powered tool we'd been developing internally to accelerate multi-step incident investigations in Grafana Cloud, quickly got to work doing the same task.

So while our on-call engineers were digging through dashboards and logs, the AI assistant quietly spun up its own background investigation. Eight minutes later, it found the root cause, 20 minutes before our on-call team came to the same conclusion.

Fast forward to today, and Assistant Investigations is now available to all Grafana Cloud users in public preview, giving you the same functionality to assist, not replace, your engineering team so it can focus on delivering great products instead of putting out fires.

Two paths, one root cause

Before we take a closer look at how this all played out, let's quickly get you up to speed on Assistant Investigations.

It coordinates a swarm of specialized AI agents to analyze your observability stack, diving into your metrics, logs, traces, and profiles to find anomalies and build up a picture of your system.

Assistant Investigations collects evidence in parallel, generating findings and hypotheses so it can provide actionable recommendations for mitigation and remediation. And since it’s embedded directly into Grafana Assistant, our AI chatbot designed specifically for Grafana, you get a seamless, guided workflow for resolving complex incidents.

OK, now let's get into the results.

Bar chart comparing incident response times: AI assistant at 8 minutes and on-call engineer at 28 minutes, with AI being 71% faster.

In the end, it turned out that a small part of one of my PRs, an AI-generated SQL query, was indeed the culprit. I'd written it to analyze Assistant usage. It worked fine in staging, passed CI, and even got reviewed thoroughly, but in production, it broke the database.

And to be fair, most of the time was spent determining if it was indeed our deploy that was causing the alerts to go off. But still, this highlights a major benefit of Assistant Investigations: It can start an investigation (and, in this case, finish it) before the people on-call even get fully up-to-speed with what’s going on.

The incident: Some AI-generated code makes its way to production databases

That "harmless" SQL query for cohort analysis created an unbounded join that ballooned under load, saturating database connections and throttling CPU.

Our on-call engineers reverted deploys, checked SQL logs, and restarted affected clusters.

At the same time, Assistant Investigations ran in the background, testing multiple hypotheses in parallel. It looked at logs, metrics, traces, and profiles to compare query volume, latency spikes, and recent deploy diffs. Ultimately, it accurately flagged the /cohort-analysis endpoint as the likely culprit.

Just like an SRE team would during an incident, the AI system deployed specialized agents to investigate different hypotheses simultaneously.

A dashboard showing agent activity with a timeline and a flowchart illustrating multiple specialized agents investigating different hypotheses.

What you're seeing in the image above:

Multiple investigation branches running in parallel (not sequential)
Specialized agents for different data sources (Prometheus, Loki, Tempo)
Hypothesis testing with confidence scores for each theory
Cross-correlation between metrics, logs, and traces

We use specialized agents that work together to get to the root cause: metrics agents checked MySQL connection pool metrics, logging agents validated deployment timing, and tracing agents analyzed database transaction patterns. Working in parallel, they built a complete picture of what went wrong, 3.5x faster than our on-call team.

Chat exchange about using investigations for an incident, with a link to the investigation results and a comment on finding the issue.

From investigation to action: the AI delivers results

Grafana dashboard showing AI analytics with sections on root cause, recommendations, and evidence. Graph displays performance metrics over time.

Key elements from this Assistant Investigations report:

Root cause identification with specific technical details
Confidence scoring (0.91 confidence, in this case)
Contributing factors breakdown (slow queries, connection holds)
Actionable remediation steps with monitoring guidance
Evidence trail linking metrics to the actual problem

The investigation didn't just identify the problem—it provided actionable recommendations, confidence scores, and supporting evidence. This was all included in one central location (as shown in the image above), but I do want to call out two of the more important lines of information it shared:

Root cause confirmed: "Cohort Analysis API exhausted database connection pool across all clusters."
Contributing factor: "Slow tenant limit queries taking 3-4 seconds creating sustained connection hold times." Plus specific remediation steps and monitoring guidance.

AI-generated code is here to stay. How can it be managed properly?

Almost every team now has some kind of AI-authored code running in production.

Sometimes it's a quick SQL query, sometimes an auto-generated handler or config file. It's faster to ship, but easier to misunderstand.

The result is that companies today are building systems and products faster than they can truly understand them at their core.

Velocity without comprehension: More code, less context
Automation bias: AI output looks right, so we assume it is
Fewer domain experts: Everyone's moving faster, but knowing less deeply

This AI-driven productivity is a net positive. And in an ideal world, every AI-generated line would get the same meticulous review as human code. In practice, that's not sustainable—and problematic code will slip through, whether written by AI or humans.

We built Assistant Investigations to scale your understanding, not just your monitoring. They take the noisy, chaotic first half of an incident and turn it into an evidence trail. And it comes complete with metrics, logs, traces, profiles, MCP integrations, query plans, and confidence scores—before you've even joined the call.

AI SRE tools don't replace engineers; they amplify them. Humans still make the calls, but now we make them faster, with better data and fewer assumptions.

Try Assistant Investigations today

Incidents aren’t slowing down and neither is AI. Assistant Investigations helps your teams keep pace, turning alerts into data and context-backed investigations before your Slack messages start popping off. Try your first investigation in Grafana Cloud today.

And for more information on Grafana Cloud AI, including FAQs about Assistant and our other AI capabilities, check out our AI observability page.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

A tale of two incident responses: How our AI assistant found the root cause 3.5x faster

Two paths, one root cause

The incident: Some AI-generated code makes its way to production databases

From investigation to action: the AI delivers results

AI-generated code is here to stay. How can it be managed properly?

Try Assistant Investigations today

Up next

Related content

Related videos

Related docs

Related products

Still have questions?

Get every update

A tale of two incident responses: How our AI assistant found the root cause 3.5x faster

Two paths, one root cause

The incident: Some AI-generated code makes its way to production databases

From investigation to action: the AI delivers results

AI-generated code is here to stay. How can it be managed properly?

Try Assistant Investigations today

Related Content

Up next

Related content

Related videos

Related docs

Related products