Grafana Cloud

Perform root cause analysis in RCA Workbench

The typical troubleshooting process usually involves accessing dashboards, opening a metrics browser, adding metrics, and checking logs in multiple browser windows. Often, issues arise where the dashboards are outdated, you’re not sure which metrics to use, or it takes a significant amount of time just to focus on a specific time range.

Asserts simplifies this process by automatically highlighting all system assertions and providing additional context. By quickly checking the Top Insights or searching within graphs, you can easily identify the root cause of issues. For more in-depth troubleshooting, the RCA Workbench is available for manual analysis of assertions.

The RCA Workbench provides the following benefits:

  • Conveniently add or remove entities as needed
  • Visualize assertions on a timeline for causal investigation
  • Access logs, traces, and other dashboards seamlessly
  • Examine an entity graph to assess the impact of current issues and spatially correlate them
  • Use a mind map to navigate assertions based on category and type

Open RCA Workbench

Complete the following steps to open RCA Workbench.

  1. Sign into Grafana and select Observability > Asserts > RCA Workbench.

  2. From the Frequently used menu, select Show all Services or Show all Nodes.

The RCA Workbench opens.

Remove entities from the Timeline

You may find that the RCA Workbench includes more entities than necessary for troubleshooting based on your query. For instance, if you choose Show all services, all services are added, potentially overwhelming you with too much information for effective troubleshooting. To streamline your troubleshooting efforts, you can remove entities from the Timeline, allowing you to focus on the specific entities you want to investigate.

Removing entities from the Timeline is beneficial when you notice a recurring pattern with a particular subset of entities and want to eliminate other irrelevant entities from your investigation.

To remove entities from the Timeline, perform the following steps:

  1. Click Timeline.

  2. Hover over an entity you want to remove and select the checkbox.

  3. Click the ‘x’ Delete entity from board.

    The following example shows the ride-service-payment entity being removed while the ingress-nginx-controller, payment, and cart entities remain because they have assertions.

    Remove entities

Add problematic entities to the Timeline

When refining the list of entities for investigation, you might find that there are specific entities you want to analyze that weren’t automatically included in the query. In such cases, you can manually add related problematic entities connected to the one you are focusing on for troubleshooting.

To add problematic entities to the Timeline, perform the following steps:

  1. Hover over an entity.

  2. Click Add problematic connections.

    The following example shows adding all problematic entities connected to the payment entity.

    Add entities

Use the Timeline to perform root cause analysis

The Timeline equips you with the necessary tools for conducting root cause analysis. After you have adjusted the list of entities by adding or removing them, you can use the Timeline to focus on a particular time period, determine the sequence in which assertions were triggered, and easily access logs, traces, and other dashboards.

To use the Timeline view to perform root cause analysis, complete the following steps:

  1. Use the time picker to select the time range you want to investigate.

  2. On the graph, click and drag on assertions to zoom in.

    Zooming in provides a clearer view of the assertions that have fired.

    Zoom in on assertions

    You might need to zoom in more than once. The following image shows a view of the assertions that have fired on the Shipping service between 11:02 PM and 11:28 PM.

    Zoom in on assertions

  3. Expand the entities on the left to show each assertion in the timeline view.

    While showing each assertion in the timeline view, you can more easily investigate assertion patterns and sequencing.

    The following image shows that the Shipping service experienced an amend and an error assertion at approximately the same time, indicating that a service update might have triggered the errors experienced by the service.

    Add assertions

  4. To investigate further, click an assertion in the left panel.

    This shows associated metrics in the Timeline.

    The following image shows that the error log rate breach steadily increased above the threshold after the amend assertion fired on the Shipping service.

    Error log rate breach

  5. To navigate to logs, click a point in time on the Timeline and click Logs.

    Navigate to logs

    A drawer opens showing the logs associated with the point in time you selected. You can expand any log row to understand more.

    View log detail

  6. To navigate to traces, click a point in time on the Timeline and click Traces.

    Navigate to traces

    A drawer opens showing the traces associated with the point in time you selected. You can click any traces to understand more or open Explore.

    View trace details

Timeline options

The following table describes when to use each timeline option:

1Open chart in metricsOpens a metrics dashboard.
2RulesShows the rule associated with the assertion. You can copy the rule and use it in other areas of the Grafana user interface.
3Nulls as zeroShows missing values as zero on the graph.
4Assertion detailsProvides detailed information about an assertion.
5Add problematic connectionsAdds related problematic entities connected to the entity you are focusing on for troubleshooting.
6EntityEnables you to navigate to connected entities.
7KPIOpens a KPI dashboard. From here, you can navigate to other related dashboards.
8Update thresholdNavigates you to the Threshold page where you can modify the threshold associated with the assertion.
9NotifyNavigates you to the Notify page where you can configure notifications related to the assertion.
10SuppressNavigates you to the Suppress page where you can suppress the associated assertion from firing.

Timeline options

View an entity graph

From within RCA Workbench you can view an entity graph to assess the impact of current issues and spatially correlate them. An entity graph enables you to understand more about the assertions, navigate to connected entities, and navigate to relevant dashboards.

To view an entity graph, click Graph.

Entity graph

Instead of using entities to navigate to assertions, you can use a mind map to navigate from assertions to entities. This view helps you identify common problems across many different entities.

  1. Within RCA Workbench, click Mind map.

  2. Expand the nodes of the mind map to view entities with the same assertions.

Assertion mind map

Summary view

The Summary view summarizes all assertions at the service or node level which enables you to quickly scan through all assertions. Unlike the Timeline, you don’t need to expand each entity to see which assertions fired.

On the Summary view, you can:

  • View an entity graph together with a timeline view
  • Explore relevant metrics in the timeline
  • Navigate to metrics, logs, and traces dashboards
  • Get a concise view of all the Assertions firing on problematic entities

Summary view