Grafana Cloud

Troubleshoot Grafana Cloud Traces

Having trouble? Try the suggestions on this page to help resolve issues.

Quick checks

  • Confirm the correct Tempo endpoint for your stack and region.
    • OTLP/HTTP: https://<stack>.grafana.net/tempo
    • OTLP/gRPC: <stack>.grafana.net:443 (no path)
  • Use the instance ID (numeric) as the username and a Cloud Access Policy token with the traces:write scope for collectors.
  • Align exporter protocol and endpoint (OTLP gRPC vs HTTP); don’t send gRPC to the HTTP path.
  • If behind a proxy, ensure TLS interception is handled (install CA or bypass Grafana domains).

Ingestion issues

Error: 401/403 Unauthorized or Forbidden

  • Use a token with traces:write. Regenerate if expired.
  • Ensure the username is your instance ID (numeric).

Error: 404/415 or connection failures

  • Verify the endpoint includes /tempo for HTTP OTLP.
  • Match exporter to protocol (HTTP vs gRPC).

Validate from Grafana Alloy:

  • Open the Grafana Alloy UI at http://localhost:12345 and confirm otelcol.receiver.otlp, otelcol.processor.batch, and otelcol.exporter.otlp are healthy.
  • Format and validate configuration: alloy fmt /path/to/config.alloy.

Discarded traces

Grafana Cloud Traces enforces ingestion limits to protect shared infrastructure. Spans that exceed these limits are discarded, which can cause missing traces or gaps in your data.

Check if spans are being discarded

Open the Billing dashboard in your Grafana Cloud stack and look at the Discarded Spans panel. This panel shows discard rates broken down by reason. For more information about the panel, refer to Discarded Spans panel.

You can also query the grafanacloud-usage data source directly to see discard rates by reason using these PromQL queries:

promql
sum by (reason) (grafanacloud_traces_instance_discarded_spans_total:rate5m)

To narrow to a specific reason, for example, traces that exceeded the size limit:

promql
grafanacloud_traces_instance_discarded_spans_total:rate5m{reason="trace_too_large"}

Common discard reasons

The following table lists the discard reasons, what each one means, and what you can do. Cloud users can’t change ingestion limits directly. Contact Grafana Support to request a limit increase.

ReasonMeaningWhat you can do
rate_limitedTenant byte rate exceeded the ingestion rate limit (default ~0.5 MB/s).Reduce volume with sampling or Adaptive Traces. Contact Support to raise the limit.
trace_too_largeA single trace exceeded the maximum trace size limit (default 5 MB).Investigate why the trace is large. Refer to Identify which services are causing discards. Contact Support to raise the limit.
live_traces_exceededToo many concurrent active traces for the tenant. The limit scales with cluster size.Reduce trace cardinality or batch size. Contact Support to raise the limit.
trace_too_large_to_compactA trace exceeded the size limit during compaction.Same as trace_too_large.
internal_errorSpans rejected during an infrastructure rollout. Alloy and OpenTelemetry Collector retry these automatically, so data loss is unlikely.Monitor for recurrence. Contact Support if persistent.

For details on how RESOURCE_EXHAUSTED errors interact with collector retry behavior, refer to Retry on RESOURCE_EXHAUSTED failure.

Identify which services are causing discards

The grafanacloud_traces_instance_discarded_spans_total:rate5m metric shows that spans are being discarded and why, but not which services are responsible.

Find oversized trace IDs in usage insights

For trace_too_large discards, Cloud Traces logs the trace ID of each oversized trace to the Usage Insights data source. To find these trace IDs:

  1. Open Explore and select the grafanacloud-<YOUR-STACK-NAME>-usage-insights data source.

  2. Run the following LogQL query:

    logql
    {instance_type="traces"} |= "TRACE_TOO_LARGE"

Each matching log line contains the trace ID and size information:

level=warn msg=TRACE_TOO_LARGE max=5000000 traceSz=4972 totalSize=6230419 trace=5a1df9e5ab59d63e3c0c3a000a83c941
  • trace is the trace ID.
  • totalSize is the cumulative size of the trace in bytes.
  • max is the configured max_bytes_per_trace limit for your tenant.

Copy the trace ID from the log line and query it in your Tempo data source to identify which services contributed spans:

traceql
{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)

Set the time range to match the period when the trace was active.

For more information about the Usage Insights data source, refer to Usage Insights dashboards.

Find traces with many spans

If you don’t have a specific trace ID, you can use TraceQL to find traces that are likely candidates for trace_too_large discards based on span count. TraceQL queries are performed in Explore with your tracing data source selected.

Query for traces with a high span count:

traceql
{} | count() > 10000 | select(name, resource.service.name)
  • {} selects all traces
  • count() > 10000 filters to traces with more than 10,000 spans (adjust the threshold for your workload)
  • select(name, resource.service.name) returns the root span name and service name

When root span information is missing

Some results may show <root span not yet received> instead of a service name. This can happen for several reasons:

  • Instrumentation is broken or incomplete (disconnected spans).
  • The root span was filtered out before reaching Cloud Traces.
  • The trace was too large and some spans were discarded.
  • The trace is long-running. Root spans tend to arrive last, so the root span may not have been received yet.
  • Spans are stored across different backend blocks.

To identify the service for a specific trace, take its trace ID and run:

traceql
{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)

This shows which services contributed spans to the trace, even when the root span is missing. Set the time range to match the period when the trace was active.

To monitor trace completeness at scale, query the following metrics from the grafanacloud-usage data source:

  • grafanacloud_traces_instance_percentage_traces_with_root_spans_flushed shows the percentage of traces that include a root span when flushed to storage. A low value indicates that root spans are frequently missing.
  • grafanacloud_traces_instance_percentage_complete_traces_flushed shows the percentage of traces flushed without orphaned spans. A low value suggests broken instrumentation or partial discards.

For the full list of available traces metrics, refer to Cloud Traces usage.

Reduce the discard rate

After you identify the services and reasons for discards:

  • Use Adaptive Traces for managed tail sampling without operational overhead. Refer to Adaptive Traces.
  • Configure sampling in your collector (head or tail sampling) to reduce volume from noisy services. Refer to Sampling strategies.
  • Fix noisy instrumentation. Common causes of oversized traces include retry loops, unbounded fan-out, and debug-level instrumentation left enabled in production.
  • Contact Grafana Support to raise limits if the ingestion volume is legitimate and you need higher throughput or larger trace sizes.

If you run self-managed Tempo, refer to Manage trace ingestion in the Tempo documentation for configuration options you can change directly.

Refer to TraceQL documentation for the TraceQL language reference.

Quick diagnostics

  • Select the correct Tempo data source/tenant and widen the time range (for example, Last 6h).

  • Start from Search builder: add resource.service.name first, run, then add one filter at a time.

  • Sanity‑check ingestion with a known trace ID:

    traceql
    { trace:id = "0123456789abcdef" }
  • Prefer trace‑level intrinsic fields for speed:

    traceql
    { trace:rootService = "api-gateway" && trace:rootName = "GET /health" }
    { trace:duration > 2s }
  • Check obvious errors first:

    traceql
    { span:status = error }
  • Verify attribute scopes/names (resource versus span), aligned to the OpenTelemetry semantic conventions:

    • resource.service.name, resource.deployment.environment
    • span.http.request.method, span.http.response.status_code, span.http.route
  • Remove pipes/group‑bys; re‑add incrementally:

    traceql
    { span:status = error } | by(resource.service.name) | count() > 1
  • Confirm quoted strings and duration units (ms, s, ns).

  • Avoid broad regular expressions until basic filters return results.

Syntax errors

  • Quote string values and separate multi‑value group‑bys with commas.
  • Prefer exact attribute names and scopes (case as emitted).
  • Examples:
traceql
{ span.http.request.method = "GET" }
{ trace:duration > 2s }

Troubleshoot Service Graph and RED metrics

Some common issues with Service Graph and RED metrics are:

  • Nothing shows in Service Graph
  • Monitor generator health
  • Late spans and slack period

Service Graph is empty

The metrics-generator is responsible for generating the service graph and RED metrics. Metrics-generation is disabled by default. You can enable it for use with Application Observability defaults in Application Observability, or contact Grafana Support to enable metrics-generation for your organization with custom settings.

Verify that the span kinds are set to the correct values. The span kinds are limited to SERVER/CONSUMER by default. You can extend to CLIENT/PRODUCER if needed.

Verify that the aggregation is set to the correct value. The aggregation may hide certain labels. You can verify the required dimensions.

Verify that the time range is set to the correct value. You need to ensure that there is sufficient recent traffic to generate metrics.

Monitor generator health

  • Query grafanacloud_traces_instance_metrics_generator_* in grafanacloud-usage
  • Use the grafanacloud-usage data source and query:
    • grafanacloud_traces_instance_metrics_generator_active_series{}
    • grafanacloud_traces_instance_metrics_generator_series_dropped_per_second{}
    • grafanacloud_traces_instance_metrics_generator_discarded_spans_per_second{}
    • grafanacloud_traces_instance_metrics_generator_label_cardinality_demand_estimate{} - estimated distinct values per label
    • grafanacloud_traces_instance_metrics_generator_label_values_limited_per_second{} - rate of label values capped by per-label limiting

Exemplars

  • Use Time series panels and toggle Exemplars on.
  • Ensure OpenMetrics output and exemplars with trace IDs; set send_exemplars=true in Alloy remote_write.
  • Verify with: curl -H "Accept: application/openmetrics-text" http://<app>/metrics | grep -i traceid.

Rate limiting and retry

  • Treat retryable errors like RESOURCE_EXHAUSTED as retryable.
  • Configure sending_queue and retry_on_failure in exporters to control memory and retries.
  • For details, refer to Retry on RESOURCE_EXHAUSTED failure.

Late spans and slack period

  • If you see “Spans arrive too late,” then the spans ended before the slack period.
  • Possible solutions:
    • Reduce tail sampling decision wait and batch timeouts. Refer to Sampling for more information.
    • Request increased metrics-generator slack (reduces metrics granularity).