---
title: "Troubleshoot an error | Grafana Cloud documentation"
description: "A step-by-step workflow for investigating errors using correlated telemetry signals."
---

> For a curated documentation index, see [llms.txt](/llms.txt). For the complete documentation index, see [llms-full.txt](/llms-full.txt).

# Troubleshoot an error

Using metrics, logs, traces, and profiles together creates a clear investigation path. Metrics show when the error rate increased and which services are affected. Logs reveal the actual error messages. Traces show the full request path and which endpoints are failing. Profiles identify the exact code causing the problem. These signals share the same time ranges and labels. You can pivot between them without manually aligning timestamps. This turns investigation from scattered tool-hopping into a guided workflow.

This workflow shows how to investigate an error using metrics, logs, and traces together.

You can try this workflow on [play.grafana.org](https://play.grafana.org) or on your own Grafana Cloud instance. Refer to [Before you begin](/docs/grafana-cloud/telemetry-signals/workflows/#before-you-begin) for more information.

## What you’ll achieve

After completing this workflow, you’ll be able to:

- Use metrics to understand error scope and timing
- Find specific error messages in logs
- Trace failing requests to understand the request flow
- Identify root causes by correlating findings across signals

## Scenario: Database connection errors

This scenario continues the investigation from [Respond to an alert](/docs/grafana-cloud/telemetry-signals/workflows/respond-to-alert/).

You’ve triaged the “High Error Rate - API Server” alert and confirmed the `/users` endpoint is throwing errors. The error rate jumped from 0.2% to 6.3% at 8:45 PM. Now you need to find the root cause.

## Example: Investigate errors

Here’s the investigation flow using different signals:

1. Metrics confirm errors started at 8:45 PM, concentrated on the `/users` endpoint
2. Logs reveal “database connection timeout” and “connection pool exhausted” messages
3. Traces show requests to `/users` calling the `user-db` service, which is timing out after 30 seconds
4. Profiles show the database connection pool is saturated

This investigation reveals that the database connection pool is exhausted because slow queries aren’t releasing connections fast enough. The immediate fix is to restart the service to clear the pool, but the underlying cause requires investigating the slow queries.

To investigate the scenario, you can use the Grafana Drilldown apps. For detailed guidance on using Drilldown apps, refer to [Simplified exploration](/docs/grafana-cloud/visualizations/simplified-exploration/).

### Check error metrics

1. Navigate to **Drilldown** &gt; **Metrics**.
2. Search for error-related metrics, for example, `http_requests` or `errors`.
3. Filter by your service label and look for 5xx status codes.
4. Note when errors started and which endpoints are affected.

### Find error logs

1. Navigate to **Drilldown** &gt; **Logs**.
2. Filter by your service name.
3. Look for error-level messages or search for “error” in the log content.
4. Expand log lines to see full error messages and stack traces.
5. Look for trace IDs in the logs. You use these to find traces.

### Trace failing requests

1. If you find a trace ID in logs, click it to jump directly to the trace.
2. Or navigate to **Drilldown** &gt; **Traces**, filter by service, and set status to **error**.
3. Select an error trace and examine:
   
   - Which span failed (marked in red)
   - Error messages in span attributes
   - The request path that led to the error

### Check downstream services

When errors point to a downstream service, investigate it:

1. Check its metrics for availability issues.
2. Check its logs for error messages.
3. If resource issues appear (high memory, CPU), use **Drilldown** &gt; **Profiles** to see what code is consuming resources.

### Analyze your findings

This table summarizes the findings for the example scenario.

Expand table

| Signal  | Example finding                                               |
|---------|---------------------------------------------------------------|
| Metrics | Errors started at 8:45 PM on the `/users` endpoint            |
| Logs    | “database connection timeout” and “connection pool exhausted” |
| Traces  | Calls to `user-db` service timing out after 30 seconds        |

The root cause is that the database connection pool is exhausted because slow queries aren’t releasing connections.

## Try the workflow

Want to try the workflow yourself? Use the public demo environment on [play.grafana.org](https://play.grafana.org) or Grafana Assistant in your own Grafana Cloud instance.

### Quick investigation with Grafana Assistant

If you have Grafana Cloud with [Grafana Assistant](/docs/grafana-cloud/machine-learning/assistant/), you can investigate errors quickly with natural language:

1. Click the **sparkle icon** in the top navigation bar to open **Grafana Assistant**.
2. Ask about the error:
   
   > “Show error logs for `api_server`”
   > 
   > “What traces have errors in the last hour?”
   > 
   > “Which services have the highest error rate?”

Assistant queries the right data sources and helps you correlate findings across signals.

### Practice on play.grafana.org

Use the public demo environment to practice error investigation with Drilldown apps.

> Note
> 
> Data in play.grafana.org fluctuates based on demo environment activity. The demo uses services like `frontend` and `nginx-json` rather than the `api_server` scenario.

1. Open [play.grafana.org](https://play.grafana.org) and navigate to **Drilldown** &gt; **Metrics**.

<!--THE END-->

1. Search for error-related metrics like `http_requests` or filter for 5xx status codes.
2. Note when errors started and which services are affected.
3. Navigate to **Drilldown** &gt; **Logs**.
4. Look at the service breakdown—services like `nginx-json` show error counts (you may see 500+ errors).
5. Click on a service with errors to filter the logs.
6. Look for error-level messages and expand log lines to see details.
7. Navigate to **Drilldown** &gt; **Traces** and select the **Errors rate** panel to see error patterns.
8. Click the **Traces** tab to see individual error traces and examine the failing spans.

## Tips

- Don’t query all logs first: Start with metrics to narrow scope.
- Don’t assume trace sampling is broken: Sampling is designed to drop some traces.
- Don’t ignore label standardization: Mismatched labels break correlation.
- Don’t use logs for counters: Metrics are cheaper and faster.

## Label mapping reference

Expand table

| Concept     | Metrics (Prometheus) | Logs (Loki)     | Traces (Tempo)           |
|-------------|----------------------|-----------------|--------------------------|
| Service     | `service` or `job`   | `service_name`  | `resource.service.name`  |
| Instance    | `instance`           | `host` or `pod` | `resource.host.name`     |
| Environment | `env`                | `environment`   | `deployment.environment` |

## Next steps

- [Respond to an alert](/docs/grafana-cloud/telemetry-signals/workflows/respond-to-alert/) - Triage alerts and route to the right workflow
- [Investigate slow performance](/docs/grafana-cloud/telemetry-signals/workflows/investigate-slow-performance/) - Investigate latency issues
- [Find slow code from a trace](/docs/grafana-cloud/telemetry-signals/workflows/find-slow-code-from-trace/) - Navigate from traces to profiles
