Run investigations
Note
Grafana Assistant Investigations are currently in public preview. Grafana Labs offers support on a best-effort basis, and breaking changes might occur prior to the feature being made generally available.
Grafana Assistant helps you investigate incidents by answering quick questions about your telemetry or launching multi-agent workflows that explore metrics, logs, traces, and profiles in the background.
Investigations solutions in Grafana
An investigation can refer to multiple features in Grafana Cloud. This page covers Grafana Assistant Investigations. Here is how it differs from other investigation features:
- Grafana Assistant Investigations: Prompt-driven, multi-agent analysis that queries metrics, logs, traces, profiles, and SQL across your Grafana Cloud data. Lives inside the Assistant and produces a structured report.
- Grafana Sift Investigation: ML-powered, automatic analysis of Kubernetes infrastructure. Runs curated detectors over cluster signals without a prompt. Free in Grafana Cloud. Not part of Grafana Assistant.
Choose the right tool for the job:
- Use Assistant Investigations for cross-signal, service-level analysis guided by your prompt.
- Use Sift Investigation for Kubernetes-only issues where you want quick, automatic triage.
- Use the Investigations app when you need a manual notebook to curate and share evidence.
Before you begin
- Investigations entitlement: Enable Grafana Assistant Investigations in Grafana Cloud.
- Investigation access: Assign the Assistant Investigation User role if your environment uses RBAC.
- Incident context: Summarize the symptom, impact, and affected services before launching an investigation.
Build infrastructure context
To provide accurate answers, the Assistant needs to understand your specific environment. You can scan your Prometheus data sources to create “infrastructure memories” of your service names, namespaces, and dependencies.
- Navigate to Grafana Assistant > Settings > Infrastructure memory.
- Select the Prometheus data sources you want to scan.
- Start the scan.
Once complete, the Assistant can map natural language queries like “How is the checkout service?” to the correct metrics and labels in your system.
Maintain memory quality
Review the generated memories to ensure service names match your team’s terminology. Re-run scans periodically or after major infrastructure changes to keep the context current. Infrastructure memories can also be re-run automatically every week.
Ask quick questions
Use the chat interface to get immediate answers about system health without writing queries manually. Context is key, mention specific resources to get the best results.
Ask about specific signals
Query metrics directly by mentioning the service or data source.
Show the error rate for @checkout-service over the last hour.
Filter logs
Search for specific patterns in your logs within a time range.
Find logs mentioning ’timeout’ in @loki-prod from 10:00 to 10:15.
Refine the answer
Iterate on the results to group or sort the data.
Group the results by pod_name.
Correlate multiple signals
The Assistant can help you verify hypotheses by checking different types of data across the same timeframe.
Establish a baseline
Start with a metric or dashboard panel to set the context.
Look at the CPU usage on this panel.
Pivot to other data
Ask the Assistant to find related logs or traces for the same timeframe.
Are there any error logs for the same service during that CPU spike?
Synthesize
Ask for a summary that connects the findings across signals.
Explain how the CPU spike relates to the error logs.
Start an investigation
For complex incidents, use Investigation mode. This launches specialized AI agents that run in the background to analyze multiple data sources, correlate findings, and generate a structured report.
When to launch an investigation
Use Investigation when issues span multiple services, require parallel analysis of metrics and logs, or when you need a structured audit trail for an incident.
You can also automate investigations using IRM webhooks. When configured, the Assistant automatically starts investigations when alert groups are created or incidents are updated.
Investigation lifecycle
- Launch: Provide a detailed prompt that captures the incident summary, timeframe, and focus areas.
- Agent execution: Investigation agents fan out across domains, including Prometheus metrics, Loki logs, Tempo traces, Pyroscope profiles, and SQL data sources, to gather evidence.
- Updates: Chat messages summarize progress. You can steer agents with feedback or additional hints.
- Report: Once agents finish, the investigation workspace compiles a structured summary, timeline, and recommended follow-ups.
Investigations track token usage separately from chat and respect the monthly tenant limits defined in Grafana Cloud.
Run an assistant investigation
Switch the chat mode to Investigation.
Provide a detailed problem statement.
High latency in the payment service. Investigate the @payment-cluster and check for database locks.
Monitor the agent progress in the investigation workspace.
Review the final report, which includes a timeline of events, key findings, and recommended next steps.
Working with investigation reports
The Summary section delivers quick status updates and recommended next steps so you can brief stakeholders fast.
Expand the Report to read the full findings along with the queries and evidence each agent collected.
The Timeline records agent tasks in order, which helps during post-incident reviews, while the Activity log provides the raw events you need to reproduce a step.
When you are ready to communicate, ask the Assistant to convert the findings into incident updates, backlog items, or dashboard follow-ups.



