Visualize data

Simplified exploration

Traces Drilldown

Determine your use case

Grafana Cloud

Determine your use case

Before you start investigating, identify your use case to choose the right approach and metric type.

Your use case determines which RED metric you start with and how you navigate through your tracing data. You might know exactly what’s wrong, or you might need to explore to find issues.

Why this concept matters

Identifying your use case helps you start your investigation efficiently. It guides you to the right RED metric and workflow, saving time and helping you find root causes faster.

Grafana Traces Drilldown supports three main types of investigations: error investigation, performance analysis, and activity monitoring. Each use case has a different starting point and workflow.

How it works

Each use case maps to a specific RED metric and investigation workflow. Your investigation goal determines which metric you start with and which tabs and views are most useful.

Error investigation uses the Errors metric to find failed requests and their root causes. Performance analysis uses the Duration metric to identify slow operations and latency bottlenecks. Activity monitoring uses the Rate metric to understand service communication patterns and request flows.

Traces Drilldown adapts its interface based on your selected metric. When you choose Errors, you see error-specific tabs like Exceptions and Root cause errors. When you choose Duration, you see latency-focused tabs like Root cause latency and Slow traces. When you choose Rate, you see Service structure to visualize service communication.

Use case 1: Investigate errors

Use this when you know requests are failing or you’ve seen error alerts.

You might have noticed:

Error alerts from your monitoring system
Failed requests in your application logs
User reports of errors or failed operations
Spikes in error rates on dashboards

How to start

Select Errors as your metric type.
Start with Root spans to see service-level error patterns.
Use the Comparison tab to identify which attributes correlate with errors.
Use the Breakdown tab to see which services or operations have the most errors.
Use the Exceptions tab to find common error messages.
Use Root cause errors to see the error chain structure.

When to switch to All spans: If you need to find errors deeper in the call chain, like database errors or downstream service failures that don’t appear at the root level, switch to All spans.

Example scenarios

You know a service is failing but not why:

Select Errors metric and Root spans.
Filter by the service name.
Use Comparison to see which attributes differ between successful and failed requests.
Use Root cause errors to see the error chain structure.

You see error alerts but don’t know the source:

Select Errors metric and Root spans.
Use Breakdown to see which services have the most errors.
Drill into the problematic service using filters.
Use Comparison to identify what’s different about the failing requests.

You need to find internal errors:

Start with Errors metric and Root spans to see service-level patterns.
If errors don’t appear at the root level, switch to All spans.
This reveals database errors, downstream service failures, or internal operation errors.
Use Exceptions to find common error messages.

Use case 2: Analyze performance

Use this when you want to identify slow operations, latency bottlenecks, or optimize response times.

You might be investigating:

Slow response times reported by users
High latency alerts
Performance degradation over time
Need to optimize specific operations

How to start

Select Duration as your metric type.
Start with Root spans for end-to-end request latency.
Use the duration heatmap to identify latency patterns.
Select percentiles (p90, p95, p99) based on your SLA requirements.
Use Root cause latency to see which operations are slowest.
Use Slow traces to examine individual slow requests.
Use Breakdown to see duration by different attributes like service, environment, or region.

When to switch to All spans: If you need to find slow internal operations like database queries or background jobs that don’t appear at the root level, switch to All spans.

Example scenarios

Users report slow responses:

Select Duration metric and Root spans.
Look at the heatmap for latency spikes.
Use Root cause latency to see which service operations are causing delays.
Use Slow traces to examine individual slow requests.

You want to optimize a specific endpoint:

Select Duration metric and Root spans.
Add filters for the endpoint.
Use Breakdown to see duration by different attributes like service, environment, or region.
Select appropriate percentiles (p90, p95, p99) based on your optimization goals.

You need to find slow database queries:

Select Duration metric and All spans (database queries appear as child spans).
Filter by database-related attributes.
Use Breakdown to see which queries are slowest.
Examine the slowest spans in Slow traces to identify problematic queries.

Use case 3: Monitor activity

Use this when you want to understand service communication patterns, request flows, or overall system activity.

You might want to:

Understand how services communicate
Monitor request rates and patterns
Identify unusual activity spikes
Map service dependencies

How to start

Select Rate as your metric type.
Start with Root spans for service-level request rates.
Use Service structure to visualize service-to-service communication.
Use Breakdown to see request rates by different attributes.
Use Comparison to identify unusual patterns compared to baseline.
Use Traces tab to examine individual requests.

When to switch to All spans: If you need to see internal operations or child spans within traces, switch to All spans. Most activity monitoring use cases work well with Root spans.

Example scenarios

You want to understand service dependencies:

Select Rate metric and Root spans.
Use Service structure to see how services call each other.
Identify the communication patterns and dependencies.
Use Traces to examine individual request flows.

You notice unusual activity spikes:

Select Rate metric and Root spans.
Use Breakdown to see which services or operations have increased rates.
Use Comparison to compare against normal baseline behavior.
Switch to Errors or Duration if the spike indicates problems.

You’re doing capacity planning:

Select Rate metric and Root spans.
Use Breakdown by service, environment, or region.
Understand request distribution patterns.
Use Service structure to see communication volumes between services.

Choose your starting point

Your starting point depends on what you already know:

You know what’s wrong:

Errors present → Start with Errors metric and Root spans.
Performance issues → Start with Duration metric and Root spans.
Specific service affected → Add a filter for that service first, then select the appropriate metric.

You need to explore: