Grafana Cloud

Troubleshoot Grafana Cloud Traces

Having trouble? Try the suggestions on this page to help resolve issues.

Quick checks

  • Confirm the correct Tempo endpoint for your stack and region.
    • OTLP/HTTP: https://<stack>.grafana.net/tempo
    • OTLP/gRPC: <stack>.grafana.net:443 (no path)
  • Use the instance ID (numeric) as the username and a Cloud Access Policy token with the traces:write scope for collectors.
  • Align exporter protocol and endpoint (OTLP gRPC vs HTTP); don’t send gRPC to the HTTP path.
  • If behind a proxy, ensure TLS interception is handled (install CA or bypass Grafana domains).

Ingestion issues

Error: 401/403 Unauthorized or Forbidden

  • Use a token with traces:write. Regenerate if expired.
  • Ensure the username is your instance ID (numeric).

Error: 404/415 or connection failures

  • Verify the endpoint includes /tempo for HTTP OTLP.
  • Match exporter to protocol (HTTP vs gRPC).

Validate from Grafana Alloy:

  • Open the Grafana Alloy UI at http://localhost:12345 and confirm otelcol.receiver.otlp, otelcol.processor.batch, and otelcol.exporter.otlp are healthy.
  • Format and validate configuration: alloy fmt /path/to/config.alloy.

Refer to TraceQL documentation for the TraceQL language reference.

Quick diagnostics

  • Select the correct Tempo data source/tenant and widen the time range (for example, Last 6h).

  • Start from Search builder: add resource.service.name first, run, then add one filter at a time.

  • Sanity‑check ingestion with a known trace ID:

    traceql
    { trace:id = "0123456789abcdef" }
  • Prefer trace‑level intrinsic fields for speed:

    traceql
    { trace:rootService = "api-gateway" && trace:rootName = "GET /health" }
    { trace:duration > 2s }
  • Check obvious errors first:

    traceql
    { span:status = error }
  • Verify attribute scopes/names (resource versus span), aligned to the OpenTelemetry semantic conventions:

    • resource.service.name, resource.deployment.environment
    • span.http.request.method, span.http.response.status_code, span.http.route
  • Remove pipes/group‑bys; re‑add incrementally:

    traceql
    { span:status = error } | by(resource.service.name) | count() > 1
  • Confirm quoted strings and duration units (ms, s, ns).

  • Avoid broad regular expressions until basic filters return results.

Syntax errors

  • Quote string values and separate multi‑value group‑bys with commas.
  • Prefer exact attribute names and scopes (case as emitted).
  • Examples:
traceql
{ span.http.request.method = "GET" }
{ trace:duration > 2s }

Troubleshoot Service Graph and RED metrics

Some common issues with Service Graph and RED metrics are:

  • Nothing shows in Service Graph
  • Monitor generator health
  • Late spans and slack period

Service Graph is empty

The metrics-generator is responsible for generating the service graph and RED metrics. Metrics-generation is disabled by default. You can enable it for use with Application Observability defaults in Application Observability, or contact Grafana Support to enable metrics-generation for your organization with custom settings.

Verify that the span kinds are set to the correct values. The span kinds are limited to SERVER/CONSUMER by default. You can extend to CLIENT/PRODUCER if needed.

Verify that the aggregation is set to the correct value. The aggregation may hide certain labels. You can verify the required dimensions.

Verify that the time range is set to the correct value. You need to ensure that there is sufficient recent traffic to generate metrics.

Monitor generator health

  • Query grafanacloud_traces_instance_metrics_generator_* in grafanacloud-usage
  • Use the grafanacloud-usage data source and query:
    • grafanacloud_traces_instance_metrics_generator_active_series{}
    • grafanacloud_traces_instance_metrics_generator_series_dropped_per_second{}
    • grafanacloud_traces_instance_metrics_generator_discarded_spans_per_second{}

Exemplars

  • Use Time series panels and toggle Exemplars on.
  • Ensure OpenMetrics output and exemplars with trace IDs; set send_exemplars=true in Alloy remote_write.
  • Verify with: curl -H "Accept: application/openmetrics-text" http://<app>/metrics | grep -i traceid.

Rate limiting and retry

  • Treat retryable errors like RESOURCE_EXHAUSTED as retryable.
  • Configure sending_queue and retry_on_failure in exporters to control memory and retries.
  • For details, refer to Retry on RESOURCE_EXHAUSTED failure.

Late spans and slack period

  • If you see “Spans arrive too late,” then the spans ended before the slack period.
  • Possible solutions:
    • Reduce tail sampling decision wait and batch timeouts. Refer to Sampling for more information.
    • Request increased metrics-generator slack (reduces metrics granularity).