Grafana Cloud

Troubleshoot an error

Using metrics, logs, traces, and profiles together creates a clear investigation path. Metrics show when the error rate increased and which services are affected. Logs reveal the actual error messages. Traces show the full request path and which endpoints are failing. Profiles identify the exact code causing the problem. These signals share the same time ranges and labels. You can pivot between them without manually aligning timestamps. This turns investigation from scattered tool-hopping into a guided workflow.

This workflow shows how to investigate an error using metrics, logs, and traces together.

You can try this workflow on play.grafana.org or on your own Grafana Cloud instance. Refer to Before you begin for more information.

What you’ll achieve

After completing this workflow, you’ll be able to:

  • Use metrics to understand error scope and timing
  • Find specific error messages in logs
  • Trace failing requests to understand the request flow
  • Identify root causes by correlating findings across signals

Scenario: Database connection errors

This scenario continues the investigation from Respond to an alert.

You’ve triaged the “High Error Rate - API Server” alert and confirmed the /users endpoint is throwing errors. The error rate jumped from 0.2% to 6.3% at 8:45 PM. Now you need to find the root cause.

Example: Investigate errors

Here’s the investigation flow using different signals:

  1. Metrics confirm errors started at 8:45 PM, concentrated on the /users endpoint
  2. Logs reveal “database connection timeout” and “connection pool exhausted” messages
  3. Traces show requests to /users calling the user-db service, which is timing out after 30 seconds
  4. Profiles show the database connection pool is saturated

This investigation reveals that the database connection pool is exhausted because slow queries aren’t releasing connections fast enough. The immediate fix is to restart the service to clear the pool, but the underlying cause requires investigating the slow queries.

To investigate the scenario, you can use the Grafana Drilldown apps. For detailed guidance on using Drilldown apps, refer to Simplified exploration.

Check error metrics

  1. Navigate to Drilldown > Metrics.
  2. Search for error-related metrics, for example, http_requests or errors.
  3. Filter by your service label and look for 5xx status codes.
  4. Note when errors started and which endpoints are affected.

Find error logs

  1. Navigate to Drilldown > Logs.
  2. Filter by your service name.
  3. Look for error-level messages or search for “error” in the log content.
  4. Expand log lines to see full error messages and stack traces.
  5. Look for trace IDs in the logs. You use these to find traces.

Trace failing requests

  1. If you find a trace ID in logs, click it to jump directly to the trace.
  2. Or navigate to Drilldown > Traces, filter by service, and set status to error.
  3. Select an error trace and examine:
    • Which span failed (marked in red)
    • Error messages in span attributes
    • The request path that led to the error

Check downstream services

When errors point to a downstream service, investigate it:

  1. Check its metrics for availability issues.
  2. Check its logs for error messages.
  3. If resource issues appear (high memory, CPU), use Drilldown > Profiles to see what code is consuming resources.

Analyze your findings

This table summarizes the findings for the example scenario.

SignalExample finding
MetricsErrors started at 8:45 PM on the /users endpoint
Logs“database connection timeout” and “connection pool exhausted”
TracesCalls to user-db service timing out after 30 seconds

The root cause is that the database connection pool is exhausted because slow queries aren’t releasing connections.

Try the workflow

Want to try the workflow yourself? Use the public demo environment on play.grafana.org or Grafana Assistant in your own Grafana Cloud instance.

Quick investigation with Grafana Assistant

If you have Grafana Cloud with Grafana Assistant, you can investigate errors quickly with natural language:

  1. Open Grafana Assistant (Ctrl+I or Cmd+I).

  2. Ask about the error:

    “Show error logs for api_server

    “What traces have errors in the last hour?”

    “Which services have the highest error rate?”

Assistant queries the right data sources and helps you correlate findings across signals.

Practice on play.grafana.org

Use the public demo environment to practice error investigation with Drilldown apps.

Note

Data in play.grafana.org fluctuates based on demo environment activity. The demo uses services like frontend and nginx-json rather than the api_server scenario.

  1. Open play.grafana.org and navigate to Drilldown > Metrics.
  1. Search for error-related metrics like http_requests or filter for 5xx status codes.

  2. Note when errors started and which services are affected.

  3. Navigate to Drilldown > Logs.

  4. Look at the service breakdown—services like nginx-json show error counts (you may see 500+ errors).

    Logs Drilldown showing services with error counts

  5. Click on a service with errors to filter the logs.

  6. Look for error-level messages and expand log lines to see details.

  7. Navigate to Drilldown > Traces and select the Errors rate panel to see error patterns.

  8. Click the Traces tab to see individual error traces and examine the failing spans.

Tips

  • Don’t query all logs first: Start with metrics to narrow scope.
  • Don’t assume trace sampling is broken: Sampling is designed to drop some traces.
  • Don’t ignore label standardization: Mismatched labels break correlation.
  • Don’t use logs for counters: Metrics are cheaper and faster.

Label mapping reference

ConceptMetrics (Prometheus)Logs (Loki)Traces (Tempo)
Serviceservice or jobservice_nameresource.service.name
Instanceinstancehost or podresource.host.name
Environmentenvenvironmentdeployment.environment

Next steps