Troubleshoot Grafana Cloud Traces
Having trouble? Try the suggestions on this page to help resolve issues.
Quick checks
- Confirm the correct Tempo endpoint for your stack and region.
- OTLP/HTTP:
https://<stack>.grafana.net/tempo - OTLP/gRPC:
<stack>.grafana.net:443(no path)
- OTLP/HTTP:
- Use the instance ID (numeric) as the username and a Cloud Access Policy token with the
traces:writescope for collectors. - Align exporter protocol and endpoint (OTLP gRPC vs HTTP); don’t send gRPC to the HTTP path.
- If behind a proxy, ensure TLS interception is handled (install CA or bypass Grafana domains).
Ingestion issues
Error: 401/403 Unauthorized or Forbidden
- Use a token with
traces:write. Regenerate if expired. - Ensure the username is your instance ID (numeric).
Error: 404/415 or connection failures
- Verify the endpoint includes
/tempofor HTTP OTLP. - Match exporter to protocol (HTTP vs gRPC).
Validate from Grafana Alloy:
- Open the Grafana Alloy UI at
http://localhost:12345and confirmotelcol.receiver.otlp,otelcol.processor.batch, andotelcol.exporter.otlpare healthy. - Format and validate configuration:
alloy fmt /path/to/config.alloy.
TraceQL and search
Refer to TraceQL documentation for the TraceQL language reference.
Quick diagnostics
Select the correct Tempo data source/tenant and widen the time range (for example, Last 6h).
Start from Search builder: add
resource.service.namefirst, run, then add one filter at a time.Sanity‑check ingestion with a known trace ID:
{ trace:id = "0123456789abcdef" }Prefer trace‑level intrinsic fields for speed:
{ trace:rootService = "api-gateway" && trace:rootName = "GET /health" } { trace:duration > 2s }Check obvious errors first:
{ span:status = error }Verify attribute scopes/names (resource versus span), aligned to the OpenTelemetry semantic conventions:
resource.service.name,resource.deployment.environmentspan.http.request.method,span.http.response.status_code,span.http.route
Remove pipes/group‑bys; re‑add incrementally:
{ span:status = error } | by(resource.service.name) | count() > 1Confirm quoted strings and duration units (
ms,s,ns).Avoid broad regular expressions until basic filters return results.
Syntax errors
- Quote string values and separate multi‑value group‑bys with commas.
- Prefer exact attribute names and scopes (case as emitted).
- Examples:
{ span.http.request.method = "GET" }
{ trace:duration > 2s }Troubleshoot Service Graph and RED metrics
Some common issues with Service Graph and RED metrics are:
- Nothing shows in Service Graph
- Monitor generator health
- Late spans and slack period
Service Graph is empty
The metrics-generator is responsible for generating the service graph and RED metrics. Metrics-generation is disabled by default. You can enable it for use with Application Observability defaults in Application Observability, or contact Grafana Support to enable metrics-generation for your organization with custom settings.
Verify that the span kinds are set to the correct values. The span kinds are limited to SERVER/CONSUMER by default. You can extend to CLIENT/PRODUCER if needed.
Verify that the aggregation is set to the correct value. The aggregation may hide certain labels. You can verify the required dimensions.
Verify that the time range is set to the correct value. You need to ensure that there is sufficient recent traffic to generate metrics.
Monitor generator health
- Query
grafanacloud_traces_instance_metrics_generator_*ingrafanacloud-usage - Use the
grafanacloud-usagedata source and query:grafanacloud_traces_instance_metrics_generator_active_series{}grafanacloud_traces_instance_metrics_generator_series_dropped_per_second{}grafanacloud_traces_instance_metrics_generator_discarded_spans_per_second{}
Exemplars
- Use Time series panels and toggle Exemplars on.
- Ensure OpenMetrics output and exemplars with trace IDs; set
send_exemplars=truein Alloyremote_write. - Verify with:
curl -H "Accept: application/openmetrics-text" http://<app>/metrics | grep -i traceid.
Rate limiting and retry
- Treat retryable errors like
RESOURCE_EXHAUSTEDas retryable. - Configure
sending_queueandretry_on_failurein exporters to control memory and retries. - For details, refer to Retry on
RESOURCE_EXHAUSTEDfailure.
Late spans and slack period
- If you see “Spans arrive too late,” then the spans ended before the slack period.
- Possible solutions:
- Reduce tail sampling decision wait and batch timeouts. Refer to Sampling for more information.
- Request increased metrics-generator slack (reduces metrics granularity).



