Resolve issues faster with contextualized root cause analysis workflows in Grafana Cloud

• 2024-09-24 • 9 min

One of the biggest challenges in troubleshooting complex microservice-based applications is the manual correlation across both application and infrastructure layers. Teams must connect these anomalies over time, understand cause-and-effect relationships, and detect patterns to diagnose issues. This process is often costly and time-consuming, prone to errors, and leads to delays in mean time to resolution (MTTR).

We took a big step forward in bridging the gap in this area last year when we announced the acquisition of Asserts.ai during ObservabilityCON 2023. Asserts leverages AI/ML to offer an automated approach to correlating anomalies across your application and infrastructure signals for faster issue resolution.

Since then, our team has been focused on integrating Asserts with Grafana Cloud solutions. Today, at ObservabilityCON 2024, we are excited to share the results of our efforts — a suite of unified workflows connecting Asserts and Grafana Cloud solutions to further simplify troubleshooting and reduce MTTR for:

By automating the correlation of anomalies across infrastructure and application layers and providing a more cohesive troubleshooting experience in Grafana Cloud, these AI-driven inferences enable even junior engineers to more effectively understand and diagnose issues in complex systems.

We’ve started rolling out these new capabilities, and by Oct. 2, all Grafana Cloud Advanced customers will have access to Asserts and these integrated workflows from the Grafana Cloud navigation menu. (Note: Kubernetes Monitoring in Grafana Cloud will be required to run Asserts. Our system will automatically analyze your setup and guide you through the onboarding process.)

In this blog post, we’ll explore:

How Asserts works
How to conduct domain-specific investigations in Grafana Cloud with Asserts
A detailed example of the unified workflows in action

What is Asserts?

Built for Prometheus and OpenTelemetry instrumentation, Asserts is a tool that automates anomaly detection and correlation, providing a unique contextual layer to your application and infrastructure telemetry. This makes it an ideal starting point for efficient and effective troubleshooting.

How Asserts works

Real-time monitoring of applications and infrastructure

For users who prefer starting with a high-level overview of their system components and their health, Asserts features the Entity Explorer. It provides a real-time map of your application and infrastructure architecture, displaying health status for entities, and indexes the graph for easy searching.

Entity Explorer example in Asserts in Grafana Cloud — The Entity Explorer correlates application and infrastructure telemetry data through an intuitive map view.

SLO-based alerts

For those who rely on alerts, Asserts helps prevent alert fatigue by allowing you to set up SLO-based alerts. Creating SLOs directly in Asserts is now simpler than ever, thanks to a curated, query-less creation flow where users select the service and its endpoints to define SLOs.

Automatic anomaly detection and correlation

When an alert is triggered, Asserts directs you to a curated RCA (Root Cause Analysis) Workbench. This workbench provides a timeline of relevant system assertions—automated checks on system anomalies—that aid in analyzing the data, isolating problem areas, and forming hypotheses about the causality between events.

RCA workbench in Asserts in Grafana Cloud — The RCA workbench automatically correlates and prioritizes system anomalies.

How to perform domain-specific analysis with Asserts and Grafana Cloud

Once users have developed a hypothesis about the potential cause of an issue, they often need to dive deeper into specific entities and their telemetry to validate their findings and pinpoint the root cause. To enable this, we’ve integrated Asserts with Grafana Cloud’s best-in-class observability solutions:

Application Observability
Kubernetes Monitoring
Infrastructure solutions for MongoDB, Jenkins, Apache Tomcat, Docker, MySQL, SNMP, PostgreSQL, Clickhouse, and Caddy. (We are continuously expanding the list of supported solutions.)

After isolating the problem domain, users can access a performance summary along with the metrics, logs, and traces of the affected entity—whether it’s an application service, Kubernetes workload, or other infrastructure component. This was made possible by integrating prebuilt performance overview dashboards from various Grafana Cloud solutions directly into Asserts. Users can also easily cross-launch into individual Grafana Cloud solutions for more detailed troubleshooting.

Moreover, these workflows are bidirectional. Asserts provides a curated library of alerting rules to analyze metrics, and it detects anomalies in the form of assertions. These assertions are categorized according to the SAAFE Model—Saturation, Amend, Anomaly, Failure, and Error—to clarify their impact on the system. With the new Asserts and Grafana Cloud workflows, users can view these assertions within individual observability solutions and navigate to the Asserts RCA Workbench to get broader context on the causality chain. This seamless experience helps users efficiently navigate application and infrastructure layers during their investigation without losing context.

Example: How Asserts and Grafana Cloud work together

Let’s illustrate how Asserts and Grafana Cloud work together through an example. Imagine you’re running a business-critical e-commerce application composed of various components, including a frontend service, product catalog service, recommendation service, PostgreSQL database, and more (see detailed architecture below).

E-commerce architecture diagram with Asserts — Architecture of the example e-commerce application.

An SLO alert triggers on Search Products Latency with a direct link to the Asserts RCA Workbench. In the timeline view of the RCA Workbench, you get a consolidated overview of all assertions triggered across the relevant components over time, providing a quick and comprehensive understanding of the situation.

Timeline view for RCA Workbench in Grafana Cloud — The RCA Workbench timeline view provides a consolidated view of system assertions.

The summary view of RCA Workbench offers an expanded view of these assertions, highlighting their evolution over time and across dependencies. By sorting them by time, we can quickly establish a chronological order of events:

A feature flag was toggled (upon expanding the assertion, we see it’s named productCatalogReadFromPostgres) and that’s when issues began to surface. This is clearly indicated by the blue bar in the UI, which denotes amends or changes in the system.
The PostgreSQL database experienced a surge in connections, with an anomalous number of connections being recorded.
Simultaneously, the Frontend, Recommendation, and Product Catalog services all started misbehaving.
- The Product Catalog service started crash-looping, followed by a spike in latency for the Get Product endpoint and a breach of the error threshold for the List Products endpoint.
- Both the Frontend and Recommendation services also started having anomalous behavior on request rates and latency.

RCA Workbench summary view in Grafana Cloud — The RCA Workbench summary view helps quickly trace the sequence of assertions.

Opening the Graph Preview on the right corner of the page adds a space dimension to our investigation and immediately tells us how these services are connected.

Graph preview in Asserts in Grafana Cloud — A small entity graph is easily accessible to add a spatial dimension to investigation.

With this big picture overview, we can start piecing together what might be happening. Could the feature flag have triggered a cascade of issues across the PostgreSQL database and its dependencies? Let’s dive deeper into the individual entities.

From the Graph Preview, we can see that the PostgreSQL database is the most downstream component and is showing signs of trouble, so let’s begin our investigation there. Opening the KPI drawer for the entity takes us to the PostgreSQL overview dashboard, powered by the PostgreSQL solution for Grafana Cloud. Asserts automatically identifies this entity as PostgreSQL and surfaces the relevant dashboard. Here, we observe that the number of active connections is spiking and dropping periodically, providing a critical insight into the issue.

Grafana dashboard with PostgreSQL solution in Grafana Cloud — The PostgreSQL overview dashboard, powered by the PostgreSQL solution for Grafana Cloud, shows periodic spikes and drops in active connections.

Next, let’s examine the data for the Product Catalog service, which is upstream from the PostgreSQL database. We already know from the assertions that this service is crash looping and that the List Products endpoint is returning errors, but what’s the underlying cause? Opening the KPI drawer for the Product Catalog service reveals a set of relevant dashboards for this entity. The Application observability dashboard, powered by Grafana Cloud Application Observability, confirms that both errors and response durations are spiking for this service.

Application Observability and Asserts dashboard in Grafana Cloud — The Application Observability dashboard, powered by Grafana Cloud Application Observability, confirms errors and durations are spiking for the product catalog service.

The Kubernetes dashboard, powered by Grafana Cloud Kubernetes Monitoring, provides a comprehensive overview of the performance of the Kubernetes workload. It quickly rules out Out of Memory (OOM) issues, confirming that the workload is not exceeding its memory limits.

Kubernetes Monitoring dashboard in Grafana Cloud with Asserts — The Kubernetes dashboard, powered by Grafana Cloud Kubernetes Monitoring, confirms the workload is not exceeding its memory limits.

Next, let’s examine the service logs for further investigation. The log message clearly indicates a null pointer exception, which is causing the service to crash loop. This, in turn, is leading to the periodic spiking and dropping of connections to the downstream PostgreSQL database.

Log lines for example product service using Asserts and Grafana Cloud — The logs for the product catalog service indicate a null pointer exception.

We can further validate this relationship by comparing the PostgresSQLHighConnections and KubePodCrashLooping assertions within the RCA workbench.

RCA Workbench in Grafana Cloud — You can expand the assertions within the RCA workbench to track their changes over time.

Now that we’ve identified the root cause of the application issue, we can outline a clear sequence of events for our incident report. The feature flag, productCatalogReadFromPostgres, was intended to enable the Product Catalog service to read data from the PostgreSQL database. However, there appears to be an issue with this feature—possibly related to the database connection, query logic, or data handling code—that leads to null pointer exceptions.

These exceptions cause the Product Catalog service to crash repeatedly, entering a CrashLoopBackOff state in Kubernetes. Each time the service restarts, it attempts to connect to PostgreSQL, leading to spikes in active connections. Since the service crashes soon after, these connections drop, resulting in the periodic pattern observed in PostgreSQL. This instability also impacts other services that rely on the Product Catalog service, such as the Frontend and Recommendation services.

With this understanding, we can promptly work with the team to roll back the feature flag, reverting the application to a stable state while we investigate the underlying code issue. After turning off the feature flag, the RCA workbench shows that the error and failure assertions—previously indicated by red bars—have all disappeared.

Root cause identified with the RCA Workbench in Grafana Cloud — The RCA Workbench clearly illustrates the impact of turning off the problematic feature flag.

Get started with Asserts and streamlined troubleshooting workflows

Starting Oct. 2, all Grafana Cloud Advanced customers can access Asserts and its unified workflows with Grafana Cloud observability solutions. To activate Asserts, navigate to the Asserts section in the Grafana Cloud menu and follow the instructions provided.

Asserts provides the best experience when used with infrastructure metrics, RED (Rate, Errors, Duration) metrics, and service graph metrics. To learn more about what’s needed to set up Asserts, please refer to our Asserts documentation. Our system will automatically analyze your setup and may prompt you to submit a support ticket if additional assistance is needed.

For comprehensive onboarding details and best practices, please refer to our getting started documentation.

Sign up for a Grafana Cloud Advanced account to get started with Asserts or contact us for volume discounts.

Resolve issues faster with contextualized root cause analysis workflows in Grafana Cloud

What is Asserts?

How Asserts works

How to perform domain-specific analysis with Asserts and Grafana Cloud

Example: How Asserts and Grafana Cloud work together

Get started with Asserts and streamlined troubleshooting workflows

Related content

Grafana Cloud updates: Fleet Management is now GA, a unified app for IRM, and more

The latest in Kubernetes Monitoring: new features to track persistent storage, simplify alerting,...

How we responded to a 2+ hour partial outage in Grafana Cloud