Investigate incidents

Grafana Cloud

Investigate incidents across multiple entities

Use RCA workbench to investigate incidents that span multiple services or infrastructure components. Correlate insights across entities on a timeline to understand what failed, when, and in what order.

When to use this workflow

Use this workflow when:

Multiple services are experiencing errors at the same time
You need to understand the sequence of events during an incident
The root cause isn’t immediately obvious from a single service
You want to correlate infrastructure issues with service degradation
An alert fired but you need to understand the full scope of impact

This is your primary tool for incident response and root cause analysis.

Before you begin

Identify the services or infrastructure involved in the incident:

Check the entity catalog for entities with critical insights
Note the approximate time the incident started
Have an initial hypothesis about which entities might be related

Open RCA workbench

From Grafana Cloud, navigate to Observability > RCA workbench.

Add entities to investigate

Gather all services and infrastructure involved in the incident into RCA workbench.

Add entities from the entity catalog

Navigate to Observability > Entity catalog.
Filter to entities with insights during the incident window.
Click each relevant entity.
Click Add to RCA workbench in the entity details.

Remove irrelevant entities

If you’ve added too many entities:

Switch to Timeline view.
Hover over an entity and click Delete entity from board (X icon).

Focus on entities directly involved in the incident.

Set the time range

Adjust the time range to focus on the incident window:

Use the time picker to select the incident period.
Start slightly before the first symptoms appeared.
Extend past when the incident resolved (or current time if ongoing).

A narrower time range makes patterns easier to spot.

Analyze the timeline

The Timeline view shows all insights chronologically across your selected entities.

Identify the first failure

Expand all entities on the left to show individual insights.
Scan the timeline from left to right.
Find the earliest insight that fired.

The first failure is often the root cause or trigger:

Amend insight (blue) - Deployment or configuration change
Failure insight (red) - Service or infrastructure failure
Saturation insight (yellow/red) - Resource limit approached

Look for cascading failures

After identifying the first failure, trace its impact:

Note the time of the first insight.
Look for insights on other entities shortly after.
Identify the propagation pattern:
- Downstream services start showing errors
- Infrastructure saturation leads to Pod restarts
- Database slowness causes service latency spikes

Use zoom to focus

Click and drag on the timeline to zoom into a specific time window:

Drag across the period of interest.
The view zooms to show more detail.
You may need to zoom multiple times for precise analysis.

To zoom out, use the time picker or click the reset zoom button.

Correlate insights with metrics

Click an insight on the timeline.
View the associated metric for that insight.
See how the metric crossed thresholds over time.
Compare with other entity metrics in the same window.

Example: Click an error rate breach insight to see the error rate spike visualized.

Investigate individual entities

From RCA workbench, drill into entities for more detail:

View entity details

Hover over an entity name in the timeline.
Click KPI.
The entity details drawer opens.
Switch between tabs:
- Service overview - RED metrics and thresholds
- Logs - Pre-filtered to the incident time range
- Traces - Request traces during the incident
- Kubernetes - Infrastructure health

Check logs for errors

Open the Logs tab for an entity.
Filter by severity: Error or Warning.
Look for stack traces or error messages at incident start time.
Search for specific error patterns.

Analyze slow traces

Open the Traces tab.
Look at the duration heatmap.
Identify traces that are slower than usual.
Click a slow trace to see:
- Which service calls took longest
- Database query times
- External API call latency

Use the entity graph

Visualize relationships between affected entities:

From RCA workbench, click Graph view.
See all entities and their connections.
Identify:
- Which service calls which (arrows show direction)
- Upstream callers vs downstream dependencies
- Infrastructure hosting the services

Navigate the graph

Click an entity in the graph.
View connected entities.
Add problematic connections to the timeline.
See if upstream or downstream services also have issues.

Common incident patterns

Recognize these common failure patterns to accelerate root cause identification.

Deployment triggered errors

Pattern: Amend insight (blue) immediately before error insights (red)

Find the Amend insight (deployment, scale event).
Note which service was deployed.
Check if error insights on that service or downstream started immediately after.
Review logs from the deployed service for startup errors.

Action: Likely a bad deployment. Rollback or investigate new code.

Resource saturation cascade

Pattern: Saturation insight followed by performance degradation across multiple services

Identify saturation insight (CPU, memory, connections).
See latency increases on the saturated service.
Upstream services show error rate increases (timeouts).
More services affected as the incident progresses.

Action: Scale resources, increase limits, or optimize the saturated service.

Database or dependency failure

Pattern: Multiple services show errors simultaneously, all calling the same dependency

Add multiple affected services to RCA workbench.
View the entity graph.
Identify shared downstream dependency.
Check if that dependency has failure insights.

Action: Investigate and restore the shared dependency.

Infrastructure failure impact

Pattern: Node or Pod failure followed by service errors

Add nodes and Pods to RCA workbench.
Find Pod restart or node NotReady insights.
Correlate with service error rate spikes.
Check if Pods running the service restarted.

Action: Fix infrastructure issue, ensure Pod rescheduling succeeds.

Use Grafana Assistant

Accelerate your analysis with Grafana Assistant:

In RCA workbench, click Analyze RCA workbench.
Grafana Assistant analyzes entities and timeline.
Ask questions like:
- “What was the first failure?”
- “Which entities are most affected?”
- “What changed before the incident?”
- “@payment-service what caused the error spike?”

Grafana Assistant can identify patterns and suggest root causes based on the timeline.

Document findings

Capture key information during your investigation to support post-incident analysis.

Key information to note

Time of first failure - When did the incident actually start?
Root cause entity - Which service or infrastructure component failed first?
Triggering event - Deployment, configuration change, external dependency?
Scope of impact - How many services/customers affected?
Propagation pattern - How did the failure spread?

Create a summary

Document:

Root cause (what failed)
Trigger (why it failed)
Impact (scope and severity)
Timeline (sequence of events)
Resolution steps (what fixed it)

Next steps during incidents

Take appropriate action based on whether the incident is ongoing or resolved.

Incident is ongoing

Mitigate: Rollback deployment, scale resources, or fail over
Communicate: Share impact scope with stakeholders
Monitor: Keep RCA workbench open to watch for new failures

Incident is resolved

Post-mortem: Document root cause and timeline
Prevent recurrence: Add monitoring, improve limits, or fix underlying issues
Share learnings: Update runbooks and team knowledge

Monitor services - Proactively catch issues before they become incidents
Identify unhealthy infrastructure - Find infrastructure root causes
Track changes - Identify deployments or configuration changes that triggered issues
Explore dependencies - Understand service relationships to predict impact