Troubleshoot Grafana Cloud Traces
Having trouble? Try the suggestions on this page to help resolve issues.
Quick checks
- Confirm the correct Tempo endpoint for your stack and region.
- OTLP/HTTP:
https://<stack>.grafana.net/tempo - OTLP/gRPC:
<stack>.grafana.net:443(no path)
- OTLP/HTTP:
- Use the instance ID (numeric) as the username and a Cloud Access Policy token with the
traces:writescope for collectors. - Align exporter protocol and endpoint (OTLP gRPC vs HTTP); don’t send gRPC to the HTTP path.
- If behind a proxy, ensure TLS interception is handled (install CA or bypass Grafana domains).
Ingestion issues
Error: 401/403 Unauthorized or Forbidden
- Use a token with
traces:write. Regenerate if expired. - Ensure the username is your instance ID (numeric).
Error: 404/415 or connection failures
- Verify the endpoint includes
/tempofor HTTP OTLP. - Match exporter to protocol (HTTP vs gRPC).
Validate from Grafana Alloy:
- Open the Grafana Alloy UI at
http://localhost:12345and confirmotelcol.receiver.otlp,otelcol.processor.batch, andotelcol.exporter.otlpare healthy. - Format and validate configuration:
alloy fmt /path/to/config.alloy.
Discarded traces
Grafana Cloud Traces enforces ingestion limits to protect shared infrastructure. Spans that exceed these limits are discarded, which can cause missing traces or gaps in your data.
Check if spans are being discarded
Open the Billing dashboard in your Grafana Cloud stack and look at the Discarded Spans panel. This panel shows discard rates broken down by reason. For more information about the panel, refer to Discarded Spans panel.
You can also query the grafanacloud-usage data source directly to see discard rates by reason using these PromQL queries:
sum by (reason) (grafanacloud_traces_instance_discarded_spans_total:rate5m)To narrow to a specific reason, for example, traces that exceeded the size limit:
grafanacloud_traces_instance_discarded_spans_total:rate5m{reason="trace_too_large"}Common discard reasons
The following table lists the discard reasons, what each one means, and what you can do. Cloud users can’t change ingestion limits directly. Contact Grafana Support to request a limit increase.
For details on how RESOURCE_EXHAUSTED errors interact with collector retry behavior, refer to Retry on RESOURCE_EXHAUSTED failure.
Identify which services are causing discards
The grafanacloud_traces_instance_discarded_spans_total:rate5m metric shows that spans are being discarded and why, but not which services are responsible.
Find oversized trace IDs in usage insights
For trace_too_large discards, Cloud Traces logs the trace ID of each oversized trace to the Usage Insights data source.
To find these trace IDs:
Open Explore and select the
grafanacloud-<YOUR-STACK-NAME>-usage-insightsdata source.Run the following LogQL query:
{instance_type="traces"} |= "TRACE_TOO_LARGE"
Each matching log line contains the trace ID and size information:
level=warn msg=TRACE_TOO_LARGE max=5000000 traceSz=4972 totalSize=6230419 trace=5a1df9e5ab59d63e3c0c3a000a83c941traceis the trace ID.totalSizeis the cumulative size of the trace in bytes.maxis the configuredmax_bytes_per_tracelimit for your tenant.
Copy the trace ID from the log line and query it in your Tempo data source to identify which services contributed spans:
{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)Set the time range to match the period when the trace was active.
For more information about the Usage Insights data source, refer to Usage Insights dashboards.
Find traces with many spans
If you don’t have a specific trace ID, you can use TraceQL to find traces that are likely candidates for trace_too_large discards based on span count.
TraceQL queries are performed in Explore with your tracing data source selected.
Query for traces with a high span count:
{} | count() > 10000 | select(name, resource.service.name){}selects all tracescount() > 10000filters to traces with more than 10,000 spans (adjust the threshold for your workload)select(name, resource.service.name)returns the root span name and service name
When root span information is missing
Some results may show <root span not yet received> instead of a service name.
This can happen for several reasons:
- Instrumentation is broken or incomplete (disconnected spans).
- The root span was filtered out before reaching Cloud Traces.
- The trace was too large and some spans were discarded.
- The trace is long-running. Root spans tend to arrive last, so the root span may not have been received yet.
- Spans are stored across different backend blocks.
To identify the service for a specific trace, take its trace ID and run:
{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)This shows which services contributed spans to the trace, even when the root span is missing. Set the time range to match the period when the trace was active.
To monitor trace completeness at scale, query the following metrics from the grafanacloud-usage data source:
grafanacloud_traces_instance_percentage_traces_with_root_spans_flushedshows the percentage of traces that include a root span when flushed to storage. A low value indicates that root spans are frequently missing.grafanacloud_traces_instance_percentage_complete_traces_flushedshows the percentage of traces flushed without orphaned spans. A low value suggests broken instrumentation or partial discards.
For the full list of available traces metrics, refer to Cloud Traces usage.
Reduce the discard rate
After you identify the services and reasons for discards:
- Use Adaptive Traces for managed tail sampling without operational overhead. Refer to Adaptive Traces.
- Configure sampling in your collector (head or tail sampling) to reduce volume from noisy services. Refer to Sampling strategies.
- Fix noisy instrumentation. Common causes of oversized traces include retry loops, unbounded fan-out, and debug-level instrumentation left enabled in production.
- Contact Grafana Support to raise limits if the ingestion volume is legitimate and you need higher throughput or larger trace sizes.
If you run self-managed Tempo, refer to Manage trace ingestion in the Tempo documentation for configuration options you can change directly.
TraceQL and search
Refer to TraceQL documentation for the TraceQL language reference.
Quick diagnostics
Select the correct Tempo data source/tenant and widen the time range (for example, Last 6h).
Start from Search builder: add
resource.service.namefirst, run, then add one filter at a time.Sanity‑check ingestion with a known trace ID:
{ trace:id = "0123456789abcdef" }Prefer trace‑level intrinsic fields for speed:
{ trace:rootService = "api-gateway" && trace:rootName = "GET /health" } { trace:duration > 2s }Check obvious errors first:
{ span:status = error }Verify attribute scopes/names (resource versus span), aligned to the OpenTelemetry semantic conventions:
resource.service.name,resource.deployment.environmentspan.http.request.method,span.http.response.status_code,span.http.route
Remove pipes/group‑bys; re‑add incrementally:
{ span:status = error } | by(resource.service.name) | count() > 1Confirm quoted strings and duration units (
ms,s,ns).Avoid broad regular expressions until basic filters return results.
Syntax errors
- Quote string values and separate multi‑value group‑bys with commas.
- Prefer exact attribute names and scopes (case as emitted).
- Examples:
{ span.http.request.method = "GET" }
{ trace:duration > 2s }Troubleshoot Service Graph and RED metrics
Some common issues with Service Graph and RED metrics are:
- Nothing shows in Service Graph
- Monitor generator health
- Late spans and slack period
Service Graph is empty
The metrics-generator is responsible for generating the service graph and RED metrics. Metrics-generation is disabled by default. You can enable it for use with Application Observability defaults in Application Observability, or contact Grafana Support to enable metrics-generation for your organization with custom settings.
Verify that the span kinds are set to the correct values. The span kinds are limited to SERVER/CONSUMER by default. You can extend to CLIENT/PRODUCER if needed.
Verify that the aggregation is set to the correct value. The aggregation may hide certain labels. You can verify the required dimensions.
Verify that the time range is set to the correct value. You need to ensure that there is sufficient recent traffic to generate metrics.
Monitor generator health
- Query
grafanacloud_traces_instance_metrics_generator_*ingrafanacloud-usage - Use the
grafanacloud-usagedata source and query:grafanacloud_traces_instance_metrics_generator_active_series{}grafanacloud_traces_instance_metrics_generator_series_dropped_per_second{}grafanacloud_traces_instance_metrics_generator_discarded_spans_per_second{}grafanacloud_traces_instance_metrics_generator_label_cardinality_demand_estimate{}- estimated distinct values per labelgrafanacloud_traces_instance_metrics_generator_label_values_limited_per_second{}- rate of label values capped by per-label limiting
Exemplars
- Use Time series panels and toggle Exemplars on.
- Ensure OpenMetrics output and exemplars with trace IDs; set
send_exemplars=truein Alloyremote_write. - Verify with:
curl -H "Accept: application/openmetrics-text" http://<app>/metrics | grep -i traceid.
Rate limiting and retry
- Treat retryable errors like
RESOURCE_EXHAUSTEDas retryable. - Configure
sending_queueandretry_on_failurein exporters to control memory and retries. - For details, refer to Retry on
RESOURCE_EXHAUSTEDfailure.
Late spans and slack period
- If you see “Spans arrive too late,” then the spans ended before the slack period.
- Possible solutions:
- Reduce tail sampling decision wait and batch timeouts. Refer to Sampling for more information.
- Request increased metrics-generator slack (reduces metrics granularity).


