---
title: "Troubleshoot Grafana Cloud Traces | Grafana Cloud documentation"
description: "Troubleshoot issues with Grafana Cloud Traces"
---

# Troubleshoot Grafana Cloud Traces

Having trouble? Try the suggestions on this page to help resolve issues.

## Quick checks

- Confirm the correct Tempo endpoint for your stack and region.
  
  - OTLP/HTTP: `https://<stack>.grafana.net/tempo`
  - OTLP/gRPC: `<stack>.grafana.net:443` (no path)
- Use the instance ID (numeric) as the username and a Cloud Access Policy token with the `traces:write` scope for collectors.
- Align exporter protocol and endpoint (OTLP gRPC vs HTTP); don’t send gRPC to the HTTP path.
- If behind a proxy, ensure TLS interception is handled (install CA or bypass Grafana domains).

## Ingestion issues

Error: 401/403 Unauthorized or Forbidden

- Use a token with `traces:write`. Regenerate if expired.
- Ensure the username is your instance ID (numeric).

Error: 404/415 or connection failures

- Verify the endpoint includes `/tempo` for HTTP OTLP.
- Match exporter to protocol (HTTP vs gRPC).

Validate from Grafana Alloy:

- Open the Grafana Alloy UI at `http://localhost:12345` and confirm `otelcol.receiver.otlp`, `otelcol.processor.batch`, and `otelcol.exporter.otlp` are healthy.
- Format and validate configuration: `alloy fmt /path/to/config.alloy`.

## Discarded traces

Grafana Cloud Traces enforces ingestion limits to protect shared infrastructure. Spans that exceed these limits are discarded, which can cause missing traces or gaps in your data.

### Check if spans are being discarded

Open the **Billing** dashboard in your Grafana Cloud stack and look at the **Discarded Spans** panel. This panel shows discard rates broken down by reason. For more information about the panel, refer to [Discarded Spans panel](/docs/grafana-cloud/cost-management-and-billing/manage-invoices/understand-your-invoice/usage-limits/#discarded-spans-panel-in-the-billing-dashboard).

You can also query the `grafanacloud-usage` data source directly to see discard rates by reason using these PromQL queries:

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
sum by (reason) (grafanacloud_traces_instance_discarded_spans_total:rate5m)
```

To narrow to a specific reason, for example, traces that exceeded the size limit:

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
grafanacloud_traces_instance_discarded_spans_total:rate5m{reason="trace_too_large"}
```

### Common discard reasons

The following table lists the discard reasons, what each one means, and what you can do. Cloud users can’t change ingestion limits directly. Contact [Grafana Support](/contact/) to request a limit increase.

Expand table

| Reason                       | Meaning                                                                                                                                 | What you can do                                                                                                                                                                                            |
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `rate_limited`               | Tenant byte rate exceeded the ingestion rate limit (default ~0.5 MB/s).                                                                 | Reduce volume with [sampling](/docs/grafana-cloud/send-data/traces/configure/sampling/) or [Adaptive Traces](/docs/grafana-cloud/adaptive-telemetry/adaptive-traces/). Contact Support to raise the limit. |
| `trace_too_large`            | A single trace exceeded the maximum trace size limit (default 5 MB).                                                                    | Investigate why the trace is large. Refer to [Identify which services are causing discards](#identify-which-services-are-causing-discards). Contact Support to raise the limit.                            |
| `live_traces_exceeded`       | Too many concurrent active traces for the tenant. The limit scales with cluster size.                                                   | Reduce trace cardinality or batch size. Contact Support to raise the limit.                                                                                                                                |
| `trace_too_large_to_compact` | A trace exceeded the size limit during compaction.                                                                                      | Same as `trace_too_large`.                                                                                                                                                                                 |
| `internal_error`             | Spans rejected during an infrastructure rollout. Alloy and OpenTelemetry Collector retry these automatically, so data loss is unlikely. | Monitor for recurrence. Contact Support if persistent.                                                                                                                                                     |

For details on how `RESOURCE_EXHAUSTED` errors interact with collector retry behavior, refer to [Retry on RESOURCE\_EXHAUSTED failure](/docs/grafana-cloud/send-data/traces/set-up/troubleshoot/).

### Identify which services are causing discards

The `grafanacloud_traces_instance_discarded_spans_total:rate5m` metric shows that spans are being discarded and why, but not which services are responsible.

#### Find oversized trace IDs in usage insights

For `trace_too_large` discards, Cloud Traces logs the trace ID of each oversized trace to the Usage Insights data source. To find these trace IDs:

1. Open **Explore** and select the `grafanacloud-<YOUR-STACK-NAME>-usage-insights` data source.
2. Run the following LogQL query:
   
   logql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
   
   ```logql
   {instance_type="traces"} |= "TRACE_TOO_LARGE"
   ```

Each matching log line contains the trace ID and size information:

![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```none
level=warn msg=TRACE_TOO_LARGE max=5000000 traceSz=4972 totalSize=6230419 trace=5a1df9e5ab59d63e3c0c3a000a83c941
```

- `trace` is the trace ID.
- `totalSize` is the cumulative size of the trace in bytes.
- `max` is the configured `max_bytes_per_trace` limit for your tenant.

Copy the trace ID from the log line and query it in your Tempo data source to identify which services contributed spans:

traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```traceql
{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)
```

Set the time range to match the period when the trace was active.

For more information about the Usage Insights data source, refer to [Usage Insights dashboards](/docs/grafana-cloud/security-and-account-management/usage-insights/).

#### Find traces with many spans

If you don’t have a specific trace ID, you can use TraceQL to find traces that are likely candidates for `trace_too_large` discards based on span count. TraceQL queries are performed in **Explore** with your tracing data source selected.

Query for traces with a high span count:

traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```traceql
{} | count() > 10000 | select(name, resource.service.name)
```

- `{}` selects all traces
- `count() > 10000` filters to traces with more than 10,000 spans (adjust the threshold for your workload)
- `select(name, resource.service.name)` returns the root span name and service name

#### When root span information is missing

Some results may show `<root span not yet received>` instead of a service name. This can happen for several reasons:

- Instrumentation is broken or incomplete (disconnected spans).
- The root span was filtered out before reaching Cloud Traces.
- The trace was too large and some spans were discarded.
- The trace is long-running. Root spans tend to arrive last, so the root span may not have been received yet.
- Spans are stored across different backend blocks.

To identify the service for a specific trace, take its trace ID and run:

traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```traceql
{ trace:id = "<TRACE_ID>" } | rate() by (resource.service.name)
```

This shows which services contributed spans to the trace, even when the root span is missing. Set the time range to match the period when the trace was active.

To monitor trace completeness at scale, query the following metrics from the `grafanacloud-usage` data source:

- `grafanacloud_traces_instance_percentage_traces_with_root_spans_flushed` shows the percentage of traces that include a root span when flushed to storage. A low value indicates that root spans are frequently missing.
- `grafanacloud_traces_instance_percentage_complete_traces_flushed` shows the percentage of traces flushed without orphaned spans. A low value suggests broken instrumentation or partial discards.

For the full list of available traces metrics, refer to [Cloud Traces usage](/docs/grafana-cloud/cost-management-and-billing/manage-invoices/understand-your-invoice/usage-limits/#cloud-traces-usage).

### Reduce the discard rate

After you identify the services and reasons for discards:

- Use Adaptive Traces for managed tail sampling without operational overhead. Refer to [Adaptive Traces](/docs/grafana-cloud/adaptive-telemetry/adaptive-traces/).
- Configure sampling in your collector (head or tail sampling) to reduce volume from noisy services. Refer to [Sampling strategies](/docs/grafana-cloud/send-data/traces/configure/sampling/).
- Fix noisy instrumentation. Common causes of oversized traces include retry loops, unbounded fan-out, and debug-level instrumentation left enabled in production.
- Contact Grafana Support to raise limits if the ingestion volume is legitimate and you need higher throughput or larger trace sizes.

If you run self-managed Tempo, refer to [Manage trace ingestion](/docs/tempo/next/operations/manage-trace-ingestion/) in the Tempo documentation for configuration options you can change directly.

## TraceQL and search

Refer to [TraceQL documentation](/docs/tempo/next/traceql/) for the TraceQL language reference.

### Quick diagnostics

- Select the correct Tempo data source/tenant and widen the time range (for example, Last 6h).
- Start from Search builder: add `resource.service.name` first, run, then add one filter at a time.
- Sanity‑check ingestion with a known trace ID:
  
  traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
  
  ```traceql
  { trace:id = "0123456789abcdef" }
  ```
- Prefer trace‑level intrinsic fields for speed:
  
  traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
  
  ```traceql
  { trace:rootService = "api-gateway" && trace:rootName = "GET /health" }
  { trace:duration > 2s }
  ```
- Check obvious errors first:
  
  traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
  
  ```traceql
  { span:status = error }
  ```
- Verify attribute scopes/names (resource versus span), aligned to the OpenTelemetry semantic conventions:
  
  - `resource.service.name`, `resource.deployment.environment`
  - `span.http.request.method`, `span.http.response.status_code`, `span.http.route`
- Remove pipes/group‑bys; re‑add incrementally:
  
  traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
  
  ```traceql
  { span:status = error } | by(resource.service.name) | count() > 1
  ```
- Confirm quoted strings and duration units (`ms`, `s`, `ns`).
- Avoid broad regular expressions until basic filters return results.

### Syntax errors

- Quote string values and separate multi‑value group‑bys with commas.
- Prefer exact attribute names and scopes (case as emitted).
- Examples:

traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```traceql
{ span.http.request.method = "GET" }
{ trace:duration > 2s }
```

## Troubleshoot Service Graph and RED metrics

Some common issues with Service Graph and RED metrics are:

- Nothing shows in Service Graph
- Monitor generator health
- Late spans and slack period

### Service Graph is empty

The metrics-generator is responsible for generating the service graph and RED metrics. Metrics-generation is disabled by default. You can enable it for use with Application Observability defaults in Application Observability, or contact Grafana Support to enable metrics-generation for your organization with custom settings.

Verify that the span kinds are set to the correct values. The span kinds are limited to `SERVER`/`CONSUMER` by default. You can extend to `CLIENT`/`PRODUCER` if needed.

Verify that the aggregation is set to the correct value. The aggregation may hide certain labels. You can verify the required dimensions.

Verify that the time range is set to the correct value. You need to ensure that there is sufficient recent traffic to generate metrics.

### Monitor generator health

- Query `grafanacloud_traces_instance_metrics_generator_*` in `grafanacloud-usage`
- Use the `grafanacloud-usage` data source and query:
  
  - `grafanacloud_traces_instance_metrics_generator_active_series{}`
  - `grafanacloud_traces_instance_metrics_generator_series_dropped_per_second{}`
  - `grafanacloud_traces_instance_metrics_generator_discarded_spans_per_second{}`
  - `grafanacloud_traces_instance_metrics_generator_label_cardinality_demand_estimate{}` - estimated distinct values per label
  - `grafanacloud_traces_instance_metrics_generator_label_values_limited_per_second{}` - rate of label values capped by per-label limiting

## Exemplars

- Use Time series panels and toggle Exemplars on.
- Ensure OpenMetrics output and exemplars with trace IDs; set `send_exemplars=true` in Alloy `remote_write`.
- Verify with: `curl -H "Accept: application/openmetrics-text" http://<app>/metrics | grep -i traceid`.

## Rate limiting and retry

- Treat retryable errors like `RESOURCE_EXHAUSTED` as retryable.
- Configure `sending_queue` and `retry_on_failure` in exporters to control memory and retries.
- For details, refer to [Retry on `RESOURCE_EXHAUSTED` failure](/docs/grafana-cloud/send-data/traces/set-up/troubleshoot/).

### Late spans and slack period

- If you see “Spans arrive too late,” then the spans ended before the slack period.
- Possible solutions:
  
  - Reduce tail sampling decision wait and batch timeouts. Refer to [Sampling](/docs/tempo/latest/set-up-for-tracing/instrument-send/set-up-collector/tail-sampling/) for more information.
  - Request increased metrics-generator slack (reduces metrics granularity).
