Issues with querying on Grafana Labs

Unable to find traces

Thu, 09 Apr 2026 14:59:14 +0000

Unable to find traces

The two main causes of missing traces are:

Issues in ingestion of the data into Tempo. Spans are either not sent correctly to Tempo or they aren’t getting sampled.
Issues querying for traces that have been received by Tempo.

Section 1: Diagnose and fix ingestion issues

The first step is to check whether the application spans are actually reaching Tempo.

Add the following flag to the distributor container - distributor.log_received_spans.enabled.

This flag enables debug logging of all the traces received by the distributor. These logs can help check if Tempo is receiving any traces at all.

You can also check the following metrics:

tempo_distributor_spans_received_total
tempo_ingester_traces_created_total

The value of both metrics should be greater than 0 within a few minutes of the application spinning up. You can check both metrics using either:

The metrics page exposed from Tempo at http://<tempo-address>:<tempo-http-port>/metrics or
In Prometheus, if it’s used to scrape metrics.

Case 1 - `tempo_distributor_spans_received_total` is 0

If the value of tempo_distributor_spans_received_total is 0, possible reasons are:

Use of incorrect protocol/port combination while initializing the tracer in the application.
Tracing records not getting picked up to send to Tempo by the internal sampler.
Application is running inside docker and sending traces to an incorrect endpoint.

Receiver specific traffic information can also be obtained using tempo_receiver_accepted_spans which has a label for the receiver (protocol used for ingestion. Ex: jaeger-thrift).

Solutions

There are three possible solutions: protocol or port problems, sampling issues, or incorrect endpoints.

To fix protocol or port problems:

Find out which communication protocol is being used by the application to emit traces. This is unique to every client SDK. For instance: Jaeger Golang Client uses Thrift Compact over UDP by default.
Check the list of supported protocols and their ports and ensure that the correct combination is being used.

To fix sampling issues:

These issues can be tricky to determine because most SDKs use a probabilistic sampler by default. This may lead to just one in a 1000 records being picked up.
Check the sampling configuration of the tracer being initialized in the application and make sure it has a high sampling rate.
Some clients also provide metrics on the number of spans reported from the application, for example jaeger_tracer_reporter_spans_total. Check the value of that metric if available and make sure it’s greater than zero.
Another way to diagnose this problem would be to generate lots and lots of traces to see if some records make their way to Tempo.

To fix an incorrect endpoint issue:

If the application is also running inside docker, make sure the application is sending traces to the correct endpoint (tempo:<receiver-port>).

Case 2 - tempo_ingester_traces_created_total is 0

If the value of tempo_ingester_traces_created_total is 0, this can indicate network issues between distributors and ingesters.

Checking the metric tempo_request_duration_seconds_count{route='/tempopb.Pusher/Push'} exposed from the ingester which indicates that it’s receiving ingestion requests from the distributor.

Solution

Check logs of distributors for a message like msg="pusher failed to consume trace data" err="DoBatch: IngesterCount <= 0". This is likely because no ingester is joining the gossip ring, make sure the same gossip ring address is supplied to the distributors and ingesters.

Diagnose and fix sampling and limits issues

If you are able to query some traces in Tempo but not others, you have come to the right section.

This could happen because of a number of reasons and some have been detailed in this blog post: Where did all my spans go? A guide to diagnosing dropped spans in Jaeger distributed tracing. This is useful if you are using the Jaeger Agent.

If you are using Grafana Alloy, continue reading the following section for metrics to monitor.

Diagnose the issue

Check if the pipeline is dropping spans. The following metrics on Grafana Alloy help determine this:

exporter_send_failed_spans_ratio_total. The value of this metric should be 0.
receiver_refused_spans_ratio_total. This value of this metric should be 0.

If the pipeline isn’t reporting any dropped spans, check whether application spans are being dropped by Tempo. The following metrics help determine this:

tempo_receiver_refused_spans. The value of tempo_receiver_refused_spans should be 0.

If the value of tempo_receiver_refused_spans is greater than 0, then the possible reason is the application spans are being dropped due to rate limiting.

Solutions

If the pipeline (Grafana Alloy) drops spans, the deployment may need to be scaled up.
There might also be issues with connectivity to Tempo backend, check Alloy logs and make sure the Tempo endpoint and credentials are correctly configured.
If Tempo drops spans, this may be due to rate limiting. Rate limiting may be appropriate and therefore not an issue. The metric simply explains the cause of the missing spans.
If you require a higher ingest volume, increase the configuration for the rate limiting by adjusting the max_traces_per_user property in the configured override limits.

Note
Check the ingestion limits page for further information on limits.

Section 3: Diagnose and fix issues with querying traces

If Tempo is correctly ingesting trace spans, then it’s time to investigate possible issues with querying the data.

Check the logs of the query-frontend. The query-frontend pod runs with two containers, query-frontend and query. Use the following command to view query-frontend logs:

kubectl logs -f pod/query-frontend-xxxxx -c query-frontend

The presence of the following errors in the log may explain issues with querying traces:

level=info ts=XXXXXXX caller=frontend.go:63 method=GET traceID=XXXXXXXXX url=/api/traces/XXXXXXXXX duration=5m41.729449877s status=500
no org id
could not dial 10.X.X.X:3200 connection refused
tenant-id not found

Possible reasons for these errors are:

The querier isn’t connected to the query-frontend. Check the value of the metric cortex_query_frontend_connected_clients exposed by the query-frontend. It should be > 0, indicating querier connections with the query-frontend.
Grafana Tempo data source isn’t configured to pass tenant-id in the Authorization header (multi-tenant deployments only).
Not connected to Tempo Querier correctly.
Insufficient permissions.

Solutions

To fix connection issues:

If the queriers aren’t connected to the query-frontend, check the following section in the querier configuration and verify the query-frontend address.
YAML
```
querier:
  frontend_worker:
    frontend_address: query-frontend-discovery.default.svc.cluster.local:9095
```
Validate the Grafana data source configuration and debug network issues between Grafana and Tempo.

To fix an insufficient permissions issue:

Verify that the querier has the LIST and GET permissions on the bucket.

Too many jobs in the queue

Thu, 09 Apr 2026 14:59:14 +0000

Too many jobs in the queue

The error message might also be

queue doesn't have room for 100 jobs
failed to add a job to work queue

You may see this error if the compactor isn’t running and the blocklist size has exploded. Possible reasons why the compactor may not be running are:

Insufficient permissions.
Compactor sitting idle because no block is hashing to it.
Incorrect configuration settings.

Diagnose the issue

Check metric tempodb_compaction_bytes_written_total If this is greater than zero (0), it means the compactor is running and writing to the backend.
Check metric tempodb_compaction_errors_total If this metric is greater than zero (0), check the logs of the compactor for an error message.

Solutions

Verify that the Compactor has the LIST, GET, PUT, and DELETE permissions on the bucket objects.
- If these permissions are missing, assign them to the compactor container.
- For detailed information, refer to the Amazon S3 permissions.
If there’s a compactor sitting idle while others are running, port-forward to the compactor’s http endpoint. Then go to /compactor/ring and click Forget on the inactive compactor.
Check the following configuration parameters to ensure that there are correct settings:
- max_block_bytes to determine when the ingester cuts blocks. A good number is anywhere from 100MB to 2GB depending on the workload.
- max_compaction_objects to determine the max number of objects in a compacted block. This should relatively high, generally in the millions.
- retention_duration for how long traces should be retained in the backend.
Check the storage section of the configuration and increase queue_depth. Do bear in mind that a deeper queue could mean longer waiting times for query responses. Adjust max_workers accordingly, which configures the number of parallel workers that query backend blocks.

storage:
  trace:
    pool:
      max_workers: 100   # worker pool determines the number of parallel requests to the object store backend
      queue_depth: 10000

Bad blocks

Thu, 09 Apr 2026 14:59:14 +0000

Bad blocks

Queries fail with an error message containing:

error querying store in Querier.FindTraceByID: error using pageFinder (1, 5927cbfb-aabe-48b2-9df5-f4c3302d915f): ...

This might indicate that there is a bad (corrupted) block in the backend.

A block can get corrupted if the ingester crashed while flushing the block to the backend.

Fixing bad blocks

At the moment, a backend block can be fixed if either the index or bloom-filter is corrupt/deleted.

To fix such a block, first download it onto a machine where you can run the tempo-cli.

Next run the tempo-cli’s gen index / gen bloom commands depending on which file is corrupt/deleted. The command will create a fresh index/bloom-filter from the data file at the required location (in the block folder). To view all of the options for this command, see the CLI docs.

Finally, upload the generated index or bloom-filter onto the object store backend under the folder for the block.

Removing bad blocks

If the above step on fixing bad blocks reveals that the data file is corrupt, the only remaining solution is to delete the block, which can result in some loss of data.

The mechanism to remove a block from the backend is backend-specific, but the block to remove will be at:

<tenant ID>/<block ID>

Tag search

Thu, 09 Apr 2026 14:59:14 +0000

Tag search

An issue occurs while searching for traces in Grafana Explore. The Service Name and Span Name drop down lists are empty, and there is a No options found message.

HTTP requests to Tempo query frontend endpoint at /api/search/tag/service.name/values would respond with an empty set.

Root cause

The introduction of a cap on the size of tags causes this issue.

Configuration parameter max_bytes_per_tag_values_query causes the return of an empty result when a query exceeds the configured value.

Solutions

There are two main solutions to this issue:

Reduce the cardinality of tags pushed to Tempo. Reducing the number of unique tag values will reduce the size returned by a tag search query.
Increase the max_bytes_per_tag_values_query parameter in the overrides block of your Tempo configuration to a value as high as 50MB.

Response larger than the max

Thu, 09 Apr 2026 14:59:14 +0000

Response larger than the max

The error message is similar to the following:

500 Internal Server Error Body: response larger than the max (<size> vs <limit>)

This error indicates that the response received or sent is too large. This can happen in multiple places, but it’s most commonly seen in the query path, with messages between the querier and the query frontend.

Solutions

Tempo server (general)

Tempo components communicate with each other via gRPC requests. To increase the maximum message size, you can increase the gRPC message size limit in the server block.

server:
  grpc_server_max_recv_msg_size: <size>
  grpc_server_max_send_msg_size: <size>

The server config block is not synchronized across components. Most likely you will need to increase the message size limit in multiple components.

Querier

Additionally, querier workers can be configured to use a larger message size limit.

querier:
    frontend_worker:
        grpc_client_config:
            max_send_msg_size: <size>

Ingestion

Lastly, message size is also limited in ingestion and can be modified in the distributor block.

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          max_recv_msg_size_mib: <size>

Long-running traces

Thu, 09 Apr 2026 14:59:14 +0000

Long-running traces

Long-running traces are created when Tempo receives spans for a trace, followed by a delay, and then Tempo receives additional spans for the same trace. If the delay between spans is great enough, the spans end up in different blocks, which can lead to inconsistency in a few ways:

When using TraceQL search, the duration information only pertains to a subset of the blocks that contain a trace. This happens because Tempo consults only enough blocks to know the TraceID of the matching spans. When performing a TraceID lookup, Tempo searches for all parts of a trace in all matching blocks, which yields greater accuracy when combined.
When using spanset operators, Tempo only evaluates the contiguous trace of the current block. This means that for a single block the conditions may evaluate to false, but to consider all parts of the trace from all blocks would evaluate true.

You can tune the ingester.trace_idle_period configuration to allow for greater control about when traces are written to a block. Extending this beyond the default 10s can allow for long running trace to be co-located in the same block, but take into account other considerations around memory consumption on the ingesters. Currently, this setting isn’t per-tenant, and so adjusting affects all ingester instances.

Data quality metrics

Tempo publishes a tempo_warnings_total metric from several components, which can aid in understanding when this situation arises.

When a trace is flushed to the WAL, it’s marshalled in the Parquet format which makes it available for TraceQL metrics and search. The more complete a trace is at this moment, the more accurate complex queries are. The disconnected_trace_flushed_to_wal and rootless_trace_flushed_to_wal metrics help operators measure how reliable their trace data pipeline is.

disconnected_trace_flushed_to_wal: Incremented when a trace is flushed that has a span with parent id that cannot be found.
rootless_trace_flushed_to_wal: incremented when a trace is flushed that doesn’t have a root span. A root span is a span with all 0 parent id.

You might see these data quality metrics if you use a Prometheus query like this to explore Tempo warnings:

sum(rate(tempo_warnings_total{}[5m])) by (reason)

This example helps determine the %age of complete traces flushed. This metric can help you optimize your instrumentation and traces pipeline and understand the impact it has on Tempo data quality.

In particular, the following query can be used to know what percentage of traces which are flushed to the wall are connected.

1 - sum(rate(tempo_warnings_total{reason="disconnected_trace_flushed_to_wal"}[5m])) / sum(rate(tempo_ingester_traces_created_total{}[5m]))

If you have long-running traces, you may also be interested in the rootless_trace_flushed_to_wal reason to know when a trace is flushed to the wall without a root trace.

You can use reason fields for discovery with this query:

sum(rate(tempo_warnings_total{}[5m])) by (reason)

In general, Tempo functions at its peak when all parts of a trace are stored within as few blocks as possible. There is a wide variety of tracing patterns in the wild, which makes it impossible to optimize for all of them.

While the preceding information can help determine what Tempo is doing, it may be worth modifying the usage pattern slightly. For example, you may want to use span links, so that traces are split up, allowing one trace to complete, while pointing to the next trace in the causal chain . This allows both traces to finish in a shorter duration, and increase the chances of ending up in the same block.

Too many requests error

Thu, 09 Apr 2026 14:59:14 +0000

Too many requests (429 error code)

if an issue occurs during a Tempo query, the error response may look like:

429 failed to execute TraceQL query: {resource.service.name != nil} | rate() by(resource.service.name) Status: 429 Too Many Requests Body: job queue full

Root cause

Tempo parallelizes work by breaking a single query into multiple requests (jobs) that are distributed to the queriers. Increasing the time range results in more jobs being created. To ensure fair resource usage and to prevent the “noisy neighbor” problem in multi-tenant environments, Tempo limits the number of jobs a tenant can run concurrently. The limit of maximun number of jobs per tenant is controlled by the query-frontend value max_outstanding_per_tenant.

Solutions

There are two main solutions to this issue:

Reduce the time range of the query.
Increase the max_outstanding_per_tenant parameter in the query-frontend configuration from the default of 2000 jobs.

query-frontend:
  max_outstanding_per_tenant:: <max number of jobs>

Issues with querying on Grafana Labs

Unable to find traces

Unable to find traces

Section 1: Diagnose and fix ingestion issues

Case 1 - tempo_distributor_spans_received_total is 0

Solutions

Case 2 - tempo_ingester_traces_created_total is 0

Solution

Diagnose and fix sampling and limits issues

Diagnose the issue

Solutions

Section 3: Diagnose and fix issues with querying traces

Solutions

Too many jobs in the queue

Too many jobs in the queue

Diagnose the issue

Solutions

Bad blocks

Bad blocks

Fixing bad blocks

Removing bad blocks

Tag search

Tag search

Root cause

Solutions

Response larger than the max

Response larger than the max

Solutions

Tempo server (general)

Querier

Ingestion

Long-running traces

Long-running traces

Data quality metrics

Too many requests error

Too many requests (429 error code)

Root cause

Solutions

Case 1 - `tempo_distributor_spans_received_total` is 0