
A tale of two incident responses: How our AI assistant found the root cause 3.5x faster
About two months ago, an incident at Grafana Labs was kicked off in typical fashion: A series of alerts were triggered, our on-call engineer acknowledged it on Slack, and the rest of the team quickly began hypothesizing about the potential culprit.

But the way the incident was resolved was anything but typical.
Yes, our internal team followed best practices to resolve the incident as quickly as possible. But at the same time, Grafana Assistant Investigations, an AI-powered tool we'd been developing internally to accelerate multi-step incident investigations in Grafana Cloud, quickly got to work doing the same task.
So while our on-call engineers were digging through dashboards and logs, the AI assistant quietly spun up its own background investigation. Eight minutes later, it found the root cause, 20 minutes before our on-call team came to the same conclusion.
Fast forward to today, and Assistant Investigations is now available to all Grafana Cloud users in public preview, giving you the same functionality to assist, not replace, your engineering team so it can focus on delivering great products instead of putting out fires.
Two paths, one root cause
Before we take a closer look at how this all played out, let's quickly get you up to speed on Assistant Investigations.
It coordinates a swarm of specialized AI agents to analyze your observability stack, diving into your metrics, logs, traces, and profiles to find anomalies and build up a picture of your system.
Assistant Investigations collects evidence in parallel, generating findings and hypotheses so it can provide actionable recommendations for mitigation and remediation. And since it’s embedded directly into Grafana Assistant, our AI chatbot designed specifically for Grafana, you get a seamless, guided workflow for resolving complex incidents.
OK, now let's get into the results.

In the end, it turned out that a small part of one of my PRs, an AI-generated SQL query, was indeed the culprit. I'd written it to analyze Assistant usage. It worked fine in staging, passed CI, and even got reviewed thoroughly, but in production, it broke the database.
And to be fair, most of the time was spent determining if it was indeed our deploy that was causing the alerts to go off. But still, this highlights a major benefit of Assistant Investigations: It can start an investigation (and, in this case, finish it) before the people on-call even get fully up-to-speed with what’s going on.
The incident: Some AI-generated code makes its way to production databases
That "harmless" SQL query for cohort analysis created an unbounded join that ballooned under load, saturating database connections and throttling CPU.
Our on-call engineers reverted deploys, checked SQL logs, and restarted affected clusters.
At the same time, Assistant Investigations ran in the background, testing multiple hypotheses in parallel. It looked at logs, metrics, traces, and profiles to compare query volume, latency spikes, and recent deploy diffs. Ultimately, it accurately flagged the /cohort-analysis endpoint as the likely culprit.
Just like an SRE team would during an incident, the AI system deployed specialized agents to investigate different hypotheses simultaneously.

What you're seeing in the image above:
- Multiple investigation branches running in parallel (not sequential)
- Specialized agents for different data sources (Prometheus, Loki, Tempo)
- Hypothesis testing with confidence scores for each theory
- Cross-correlation between metrics, logs, and traces
We use specialized agents that work together to get to the root cause: metrics agents checked MySQL connection pool metrics, logging agents validated deployment timing, and tracing agents analyzed database transaction patterns. Working in parallel, they built a complete picture of what went wrong, 3.5x faster than our on-call team.


Key elements from this Assistant Investigations report:
- Root cause identification with specific technical details
- Confidence scoring (0.91 confidence, in this case)
- Contributing factors breakdown (slow queries, connection holds)
- Actionable remediation steps with monitoring guidance
- Evidence trail linking metrics to the actual problem
The investigation didn't just identify the problem—it provided actionable recommendations, confidence scores, and supporting evidence. This was all included in one central location (as shown in the image above), but I do want to call out two of the more important lines of information it shared:
- Root cause confirmed: "Cohort Analysis API exhausted database connection pool across all clusters."
- Contributing factor: "Slow tenant limit queries taking 3-4 seconds creating sustained connection hold times." Plus specific remediation steps and monitoring guidance.
AI-generated code is here to stay. How can it be managed properly?
Almost every team now has some kind of AI-authored code running in production.
Sometimes it's a quick SQL query, sometimes an auto-generated handler or config file. It's faster to ship, but easier to misunderstand.
The result is that companies today are building systems and products faster than they can truly understand them at their core.
- Velocity without comprehension: More code, less context
- Automation bias: AI output looks right, so we assume it is
- Fewer domain experts: Everyone's moving faster, but knowing less deeply
This AI-driven productivity is a net positive. And in an ideal world, every AI-generated line would get the same meticulous review as human code. In practice, that's not sustainable—and problematic code will slip through, whether written by AI or humans.
We built Assistant Investigations to scale your understanding, not just your monitoring. They take the noisy, chaotic first half of an incident and turn it into an evidence trail. And it comes complete with metrics, logs, traces, profiles, MCP integrations, query plans, and confidence scores—before you've even joined the call.
AI SRE tools don't replace engineers; they amplify them. Humans still make the calls, but now we make them faster, with better data and fewer assumptions.
Try Assistant Investigations today
Incidents aren’t slowing down and neither is AI. Assistant Investigations helps your teams keep pace, turning alerts into data and context-backed investigations before your Slack messages start popping off. Try your first investigation in Grafana Cloud today.
And for more information on Grafana Cloud AI, including FAQs about Assistant and our other AI capabilities, check out our AI observability page.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!


