Instrument and send data

Traces

Troubleshoot

Grafana Cloud

Troubleshoot Grafana Cloud Traces

Having trouble? Try the suggestions on this page to help resolve issues.

Quick checks

Confirm the correct Tempo endpoint for your stack and region.
- OTLP/HTTP: https://<stack>.grafana.net/tempo
- OTLP/gRPC: <stack>.grafana.net:443 (no path)
Use the instance ID (numeric) as the username and a Cloud Access Policy token with the traces:write scope for collectors.
Align exporter protocol and endpoint (OTLP gRPC vs HTTP); don’t send gRPC to the HTTP path.
If behind a proxy, ensure TLS interception is handled (install CA or bypass Grafana domains).

Ingestion issues

The Tempo endpoint (<STACK>.grafana.net/tempo) is primarily for querying traces. For curl-based verification, use the OTLP/HTTP gateway endpoint: https://otlp-gateway-<REGION>.grafana.net/otlp/v1/traces. For OTLP/gRPC ingestion, use the OTLP gateway gRPC endpoint, typically otlp-gateway-<REGION>.grafana.net:443, with the required authentication. The /otlp/v1/traces path is HTTP-specific.

To isolate whether the issue is in your pipeline or credentials, try pushing a test trace directly. Refer to Verify trace ingestion with curl.

Error: 401/403 Unauthorized or Forbidden

Use a token with traces:write. Regenerate if expired.
Ensure the username is your instance ID (numeric).

Error: 404/415 or connection failures

Verify the endpoint includes /tempo for HTTP OTLP.
Match exporter to protocol (HTTP vs gRPC).

Error: ALPN negotiation failure with gRPC

Some OpenTelemetry SDK gRPC clients verify the ALPN (Application-Layer Protocol Negotiation) property during TLS handshake. If a proxy between your application and the Grafana Cloud Traces endpoint doesn’t participate in HTTP/2 negotiation, the client can’t confirm the protocol upgrade and the connection fails.

A typical error message looks like:

status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN:
  Cannot check peer: missing selected ALPN property."

To work around this issue, use one of the following approaches:

Switch to OTLP/HTTP. Configure your SDK or exporter to use https://otlp-gateway-<REGION>.grafana.net/otlp/v1/traces with OTLP/HTTP instead of gRPC. OTLP/HTTP uses HTTP/1.1, which doesn’t require ALPN negotiation.
Route through a local collector. Deploy Grafana Alloy or the OpenTelemetry Collector locally and configure the collector to export traces to Grafana Cloud using OTLP/HTTP. Your application sends gRPC to the local collector, and the collector forwards using OTLP/HTTP, which avoids the ALPN issue. This only helps if the collector’s outbound connection also uses OTLP/HTTP. If the collector uses gRPC to reach Grafana Cloud, the same ALPN failure occurs. Refer to Set up traces with Grafana Alloy for configuration details.

Validate from Grafana Alloy:

Open the Grafana Alloy UI at http://localhost:12345 and confirm otelcol.receiver.otlp, otelcol.processor.batch, and otelcol.exporter.otlp are healthy.
Format and validate configuration: alloy fmt /path/to/config.alloy.

Discarded traces

Grafana Cloud Traces enforces ingestion limits to protect shared infrastructure. Spans that exceed these limits are discarded, which can cause missing traces or gaps in your data.

Check if spans are being discarded

Open the Billing dashboard in your Grafana Cloud stack and look at the Discarded Spans panel. This panel shows discard rates broken down by reason. For more information about the panel, refer to Discarded Spans panel.

You can also query the grafanacloud-usage data source directly to see discard rates by reason using these PromQL queries:

sum by (reason) (grafanacloud_traces_instance_discarded_spans_total:rate5m)

To narrow to a specific reason, for example, traces that exceeded the size limit:

grafanacloud_traces_instance_discarded_spans_total:rate5m{reason="trace_too_large"}

Common discard reasons

The following table lists the discard reasons, what each one means, and what you can do. Cloud users can’t change ingestion limits directly. Contact Grafana Support to request a limit increase.

Reason	Meaning	What you can do
`rate_limited`	Tenant byte rate exceeded the ingestion rate limit (default ~0.5 MB/s).	Reduce volume with sampling or Adaptive Traces. Contact Support to raise the limit.
`trace_too_large`	A single trace exceeded the maximum trace size limit (default 5 MB).	Investigate why the trace is large. Refer to Identify which services are causing discards. Contact Support to raise the limit.
`live_traces_exceeded`	Too many concurrent active traces for the tenant. The limit scales with cluster size.	Reduce trace cardinality or batch size. Contact Support to raise the limit.
`trace_too_large_to_compact`	A trace exceeded the size limit during compaction.	Same as `trace_too_large`.
`internal_error`	Spans rejected during an infrastructure rollout. Alloy and OpenTelemetry Collector retry these automatically, so data loss is unlikely.	Monitor for recurrence. Contact Support if persistent.

For details on how RESOURCE_EXHAUSTED errors interact with collector retry behavior, refer to Retry on RESOURCE_EXHAUSTED failure.

Identify which services are causing discards

The grafanacloud_traces_instance_discarded_spans_total:rate5m metric shows that spans are being discarded and why, but not which services are responsible.

Find oversized trace IDs in usage insights

For trace_too_large discards, Cloud Traces logs the trace ID of each oversized trace to the Usage Insights data source. To find these trace IDs:

Open Explore and select the grafanacloud-<YOUR-STACK-NAME>-usage-insights data source.

Run the following LogQL query:

{instance_type="traces"} |= "TRACE_TOO_LARGE"

Each matching log line contains the trace ID and size information:

level=warn msg=TRACE_TOO_LARGE max=5000000 traceSz=4972 totalSize=6230419 trace=5a1df9e5ab59d63e3c0c3a000a83c941

trace is the trace ID.
totalSize is the cumulative size of the trace in bytes.
max is the configured max_bytes_per_trace limit for your tenant.

Copy the trace ID from the log line and query it in your Tempo data source to identify which services contributed spans:

{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)

Set the time range to match the period when the trace was active.

For more information about the Usage Insights data source, refer to Usage Insights dashboards.

Find traces with many spans

If you don’t have a specific trace ID, you can use TraceQL to find traces that are likely candidates for trace_too_large discards based on span count. TraceQL queries are performed in Explore with your tracing data source selected.

Query for traces with a high span count:

{} | count() > 10000 | select(name, resource.service.name)

{} selects all traces
count() > 10000 filters to traces with more than 10,000 spans (adjust the threshold for your workload)
select(name, resource.service.name) returns the root span name and service name

When root span information is missing

Some results may show <root span not yet received> instead of a service name. This can happen for several reasons:

Instrumentation is broken or incomplete (disconnected spans).
The root span was filtered out before reaching Cloud Traces.
The trace was too large and some spans were discarded.
The trace is long-running. Root spans tend to arrive last, so the root span may not have been received yet.
Spans are stored across different backend blocks.

To identify the service for a specific trace, take its trace ID and run:

{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)

This shows which services contributed spans to the trace, even when the root span is missing. Set the time range to match the period when the trace was active.

To monitor trace completeness at scale, query the following metrics from the grafanacloud-usage data source:

grafanacloud_traces_instance_percentage_traces_with_root_spans_flushed shows the percentage of traces that include a root span when flushed to storage. A low value indicates that root spans are frequently missing.
grafanacloud_traces_instance_percentage_complete_traces_flushed shows the percentage of traces flushed without orphaned spans. A low value suggests broken instrumentation or partial discards.

For the full list of available traces metrics, refer to Cloud Traces usage.

Reduce the discard rate

After you identify the services and reasons for discards:

Use Adaptive Traces for managed tail sampling without operational overhead. Refer to Adaptive Traces.
Configure sampling in your collector (head or tail sampling) to reduce volume from noisy services. Refer to Sampling strategies.
Reduce individual trace size by fixing instrumentation patterns, setting SDK limits, or filtering in your collector pipeline. Refer to Reduce individual trace size.
Contact Grafana Support to raise limits if the ingestion volume is legitimate and you need higher throughput or larger trace sizes.

If you run self-managed Tempo, refer to Manage trace ingestion in the Tempo documentation for configuration options you can change directly.

TraceQL and search

Refer to TraceQL documentation for the TraceQL language reference.

Quick diagnostics

Select the correct Tempo data source/tenant and widen the time range (for example, Last 6h).
Start from Search builder: add resource.service.name first, run, then add one filter at a time.
Sanity‑check ingestion with a known trace ID:
traceql
```
{ trace:id = "0123456789abcdef" }
```

Prefer trace‑level intrinsic fields for speed:

{ trace:rootService = "api-gateway" && trace:rootName = "GET /health" }
{ trace:duration > 2s }

Check obvious errors first:
traceql
```
{ span:status = error }
```
Verify attribute scopes/names (resource versus span), aligned to the OpenTelemetry semantic conventions:
- resource.service.name, resource.deployment.environment
- span.http.request.method, span.http.response.status_code, span.http.route

Remove pipes/group‑bys; re‑add incrementally:

{ span:status = error } | by(resource.service.name) | count() > 1

Confirm quoted strings and duration units (ms, s, ns).
Avoid broad regular expressions until basic filters return results.

Syntax errors

Quote string values and separate multi‑value group‑bys with commas.
Prefer exact attribute names and scopes (case as emitted).
Examples:

{ span.http.request.method = "GET" }
{ trace:duration > 2s }

Troubleshoot Service Graph and RED metrics

Some common issues with Service Graph and RED metrics are:

Nothing shows in Service Graph
Monitor generator health
Late spans and slack period

Service Graph is empty

The metrics-generator is responsible for generating the service graph and RED metrics. Metrics-generation is disabled by default. You can enable it for use with Application Observability defaults in Application Observability, or contact Grafana Support to enable metrics-generation for your organization with custom settings.

Verify that the span kinds are set to the correct values. The span kinds are limited to SERVER/CONSUMER by default. You can extend to CLIENT/PRODUCER if needed.

Verify that the aggregation is set to the correct value. The aggregation may hide certain labels. You can verify the required dimensions.

Verify that the time range is set to the correct value. You need to ensure that there is sufficient recent traffic to generate metrics.

Monitor generator health

Query grafanacloud_traces_instance_metrics_generator_* in grafanacloud-usage
Use the grafanacloud-usage data source and query:
- grafanacloud_traces_instance_metrics_generator_active_series{}
- grafanacloud_traces_instance_metrics_generator_series_dropped_per_second{}
- grafanacloud_traces_instance_metrics_generator_discarded_spans_per_second{}
- grafanacloud_traces_instance_metrics_generator_label_cardinality_demand_estimate{} - estimated distinct values per label
- grafanacloud_traces_instance_metrics_generator_label_values_limited_per_second{} - rate of label values capped by per-label limiting

Exemplars

Use Time series panels and toggle Exemplars on.
Ensure OpenMetrics output and exemplars with trace IDs; set send_exemplars=true in Alloy remote_write.
Verify with: curl -H "Accept: application/openmetrics-text" http://<app>/metrics | grep -i traceid.

Rate limiting and retry

Treat retryable errors like RESOURCE_EXHAUSTED as retryable.
Configure sending_queue and retry_on_failure in exporters to control memory and retries.
For details, refer to Retry on RESOURCE_EXHAUSTED failure.

Late spans and slack period

If you see “Spans arrive too late,” then the spans ended before the slack period.
Possible solutions:
- Reduce tail sampling decision wait and batch timeouts. Refer to Sampling for more information.
- Request increased metrics-generator slack (reduces metrics granularity).

Was this page helpful?

Email docs@grafana.com

Help and support

Community

Troubleshoot Grafana Cloud Traces

Quick checks

Ingestion issues

Discarded traces

Check if spans are being discarded

Common discard reasons

Identify which services are causing discards

Find oversized trace IDs in usage insights

Find traces with many spans

When root span information is missing

Reduce the discard rate

TraceQL and search

Quick diagnostics

Syntax errors

Troubleshoot Service Graph and RED metrics

Service Graph is empty

Monitor generator health

Exemplars

Rate limiting and retry

Late spans and slack period

Was this page helpful?

Still have questions?

Get every update

Troubleshoot Grafana Cloud Traces

Quick checks

Ingestion issues

Discarded traces

Check if spans are being discarded

Common discard reasons

Identify which services are causing discards

Find oversized trace IDs in usage insights

Find traces with many spans

When root span information is missing

Reduce the discard rate

TraceQL and search

Quick diagnostics

Syntax errors

Troubleshoot Service Graph and RED metrics

Service Graph is empty

Monitor generator health

Exemplars

Rate limiting and retry

Late spans and slack period

Was this page helpful?

Related resources from Grafana Labs