---
title: "Investigate slow performance | Grafana Cloud documentation"
description: "A step-by-step workflow for investigating latency issues using correlated telemetry signals."
---

> For a curated documentation index, see [llms.txt](/llms.txt). For the complete documentation index, see [llms-full.txt](/llms-full.txt).

# Investigate slow performance

When users report slowness, you need to pinpoint exactly where time is being spent. Metrics quantify the problem—which endpoints are slow, by how much, and since when. Traces show the request path and reveal which service or span is the bottleneck. Profiles identify the exact functions consuming CPU or memory. By correlating these signals, you can move from “the app is slow” to “this specific database query is taking 600ms instead of 50ms” without guessing.

This workflow shows how to investigate latency issues using metrics, traces, and profiles together.

You can try this workflow on [play.grafana.org](https://play.grafana.org) or on your own Grafana Cloud instance. Refer to [Before you begin](/docs/grafana-cloud/telemetry-signals/workflows/#before-you-begin) for more information.

## What you’ll achieve

After completing this workflow, you’ll be able to:

- Quantify latency issues using metrics
- Find slow requests using traces
- Compare slow and fast traces to identify differences
- Profile code to find bottlenecks

## Scenario: Slow database queries

This scenario continues the investigation from [Troubleshoot an error](/docs/grafana-cloud/telemetry-signals/workflows/troubleshoot-error/).

After restarting `api_server` to clear the exhausted connection pool, the service recovers but users still report slowness. Metrics show p99 latency for the `/users` endpoint is 800ms—normal is 150ms. The latency increase started at 8:30 PM, 15 minutes before the errors began.

## Example: Investigate latency

Here’s the investigation flow using different signals:

1. **Metrics** show p99 latency for `/users` went from 150ms to 800ms at 8:30 PM
2. **Traces** reveal the `user-db` span is taking 600ms instead of the normal 50ms
3. **Trace comparison** shows slow traces execute `SELECT * FROM users WHERE email LIKE '%...'` (full scan), while fast traces use indexed lookups
4. **Profiles** identify the `UserRepository.findByEmail()` function as the bottleneck

This investigation reveals that a recent code change introduced a full table scan query pattern. The underlying fix requires optimizing the query or adding an index. Refer to [Find slow code from a trace](/docs/grafana-cloud/telemetry-signals/workflows/find-slow-code-from-trace/) for more information.

To investigate the scenario, you can use the Grafana Drilldown apps. For detailed guidance on using Drilldown apps, refer to [Simplified exploration](/docs/grafana-cloud/visualizations/simplified-exploration/).

### Check latency metrics

1. Navigate to **Drilldown** &gt; **Metrics**.
2. Search for latency metrics, for example, `http_request_duration` or `latency`.
3. Filter by service and look at p99 or p95 percentiles.
4. Note when latency increased and which endpoints are affected.

### Find slow traces

1. Navigate to **Drilldown** &gt; **Traces**.
2. Look at the duration histogram on the right—traces with higher durations appear on the right side.
3. Click on slow traces (high duration values) to examine them.
4. In the trace view, look at the span timeline to see where time is spent.

### Identify the bottleneck

In the trace timeline:

- Look for the widest spans—these consume the most time.
- Check if slowness is in your service or a downstream call.
- Note the span name and service for further investigation.

### Compare slow and fast traces

1. In **Drilldown** &gt; **Traces**, click on a fast trace (low duration).
2. Open both traces in separate tabs and compare:
   
   - Are the same services called?
   - Which spans are different?
   - Are there different attributes, for example, query type, user, or region?

### Check resource metrics

1. Navigate to **Drilldown** &gt; **Metrics**.
2. Search for resource metrics (CPU, memory) for the slow service.
3. Look for spikes or saturation that correlate with the latency.

### Profile the code (if available)

If the slow span is in your application code (not an external call):

1. Navigate to **Drilldown** &gt; **Profiles**.
2. Select CPU profile for the service.
3. Set the time range to when slowness occurred.
4. Look for wide bars in the flame graph—these are hot functions.

## Try the workflow

Want to try the workflow yourself? Use the public demo environment on [play.grafana.org](https://play.grafana.org) or Grafana Assistant in your own Grafana Cloud instance.

### Quick investigation with Grafana Assistant

If you have Grafana Cloud with [Grafana Assistant](/docs/grafana-cloud/machine-learning/assistant/), you can investigate latency quickly with natural language:

1. Click the **sparkle icon** in the top navigation bar to open **Grafana Assistant**.
2. Ask about latency:
   
   > “Show p99 latency for `api_server`”
   > 
   > “Which endpoints have the highest latency?”
   > 
   > “Find slow traces for `checkoutservice`”

Assistant queries the right data sources and helps you identify bottleneck spans.

### Practice on play.grafana.org

Use the public demo environment to practice latency investigation with Drilldown apps.

> Note
> 
> Data in play.grafana.org fluctuates based on demo environment activity. The demo uses services like `frontend` and `checkoutservice` rather than the `api_server` scenario.

1. Open [play.grafana.org](https://play.grafana.org) and navigate to **Drilldown** &gt; **Metrics**.

<!--THE END-->

1. Search for latency-related metrics and note which services show elevated p99 values.
2. Navigate to **Drilldown** &gt; **Traces**.
3. Look at the **duration histogram** (blue panel on the right)—traces on the right side have higher latency.
4. Click on a slow trace (high duration) to open the trace view.
5. Examine the span timeline to see which span took the most time.
6. Compare with a fast trace to identify what’s different.
7. Navigate to **Drilldown** &gt; **Profiles** to see if CPU profiling data is available for the slow service.

## Analyze your findings

Combine findings from all signals:

Expand table

| Signal           | Example finding                                                     |
|------------------|---------------------------------------------------------------------|
| Metrics          | p99 for `/users` went from 150ms to 800ms at 8:30 PM                |
| Traces           | Time spent in `user-db` span (600ms instead of 50ms)                |
| Trace comparison | Slow traces use `LIKE '%...'` query, fast traces use indexed lookup |
| Resources        | Database CPU spiked at 8:30 PM                                      |
| Profiles         | `UserRepository.findByEmail()` consuming 65% of CPU                 |

*Root cause*: A recent code change introduced a query pattern that triggers full table scans instead of using the email index.

## Tips

- Start with metrics. Quantify the problem first to understand scope.
- Check downstream services. Investigate dependencies before assuming the issue is in your code.
- Use traces before profiles. Identify what to profile from traces.

## Next steps

- [Respond to an alert](/docs/grafana-cloud/telemetry-signals/workflows/respond-to-alert/) - Triage alerts and route to the right workflow
- [Troubleshoot an error](/docs/grafana-cloud/telemetry-signals/workflows/troubleshoot-error/) - Investigate errors using metrics, logs, and traces
- [Find slow code from a trace](/docs/grafana-cloud/telemetry-signals/workflows/find-slow-code-from-trace/) - Navigate from traces to profiles
