Issues with querying on Grafana Labs

Unable to find traces

Thu, 28 May 2026 17:50:33 +0100

Unable to find traces

The two main causes of missing traces are:

Issues in ingestion of the data into Tempo. Spans are either not sent correctly to Tempo or they aren’t getting sampled.
Issues querying for traces that have been received by Tempo.

Section 1: Diagnose and fix ingestion issues

The first step is to check whether the application spans are actually reaching Tempo.

Add the following flag to the distributor container - distributor.log_received_spans.enabled.

This flag enables debug logging of all the traces received by the distributor. These logs can help check if Tempo is receiving any traces at all.

You can also check the following metrics:

tempo_distributor_spans_received_total
tempo_live_store_traces_created_total

The value of both metrics should be greater than 0 within a few minutes of the application spinning up. You can check both metrics using either:

The metrics page exposed from Tempo at http://<tempo-address>:<tempo-http-port>/metrics or
In Prometheus, if it’s used to scrape metrics.

Case 1 - `tempo_distributor_spans_received_total` is 0

If the value of tempo_distributor_spans_received_total is 0, possible reasons are:

Use of incorrect protocol/port combination while initializing the tracer in the application.
Tracing records not getting picked up to send to Tempo by the internal sampler.
Application is running inside docker and sending traces to an incorrect endpoint.

Receiver specific traffic information can also be obtained using tempo_receiver_accepted_spans which has a label for the receiver (protocol used for ingestion. Ex: jaeger-thrift).

Solutions

There are three possible solutions: protocol or port problems, sampling issues, or incorrect endpoints.

To fix protocol or port problems:

Find out which communication protocol is being used by the application to emit traces. This is unique to every client SDK. For instance: Jaeger Golang Client uses Thrift Compact over UDP by default.
Check the list of supported protocols and their ports and ensure that the correct combination is being used.

To fix sampling issues:

These issues can be tricky to determine because most SDKs use a probabilistic sampler by default. This may lead to just one in a 1000 records being picked up.
Check the sampling configuration of the tracer being initialized in the application and make sure it has a high sampling rate.
Some clients also provide metrics on the number of spans reported from the application, for example jaeger_tracer_reporter_spans_total. Check the value of that metric if available and make sure it’s greater than zero.
Another way to diagnose this problem would be to generate lots and lots of traces to see if some records make their way to Tempo.

To fix an incorrect endpoint issue:

If the application is also running inside docker, make sure the application is sending traces to the correct endpoint (tempo:<receiver-port>).

Case 2 - tempo_live_store_traces_created_total is 0

If the value of tempo_live_store_traces_created_total is 0, this can indicate issues between the distributors and Kafka, or between Kafka and the live-stores.

Solution

Check distributor logs for Kafka write errors such as msg="failed to write to kafka".
Verify that Kafka is healthy and that the distributors can reach it.
Check live-store logs to ensure they are consuming from Kafka successfully. Look for consumer lag metrics to confirm data is flowing.

Case 3 - Live-store Kafka lag

If the live-store is lagging behind its Kafka partition, queries for recent data may return incomplete results.

To check whether lag is affecting queries, run the following PromQL query in Grafana or Prometheus:

rate(tempo_live_store_lagged_requests_total[5m])

A non-zero rate means that query time ranges are overlapping with the live-store’s Kafka lag, and some recently ingested traces may be missing from results. The metric is labeled by route, so you can see which query type is affected (/tempopb.Querier/SearchRecent for search queries or /tempopb.Metrics/QueryRange for TraceQL metrics queries).

Solution

Check the raw consumer lag per partition using your live-store consumer group label:
promql
```
tempo_ingest_group_partition_lag{group="<CONSUMER_GROUP>"}
```
The group label is derived from the live-store ring instance ID. For example, in a zone-aware deployment the group might be live-store-zone-a.
If lag is persistent, the live-store may need more resources or partitions may need to be redistributed.
To make incomplete results explicit, set fail_on_high_lag: true in the live-store configuration. When enabled, the live-store returns an error instead of silently incomplete results.

Case 4 - Trace is not recent

Live-stores only serve recent data. Older traces are stored in blocks built by the block-builder. If a trace was ingested but can’t be found, the block-builder may not be flushing blocks to the backend correctly.

Solution

Check block-builder logs for errors during block creation or flushing to object storage.
Verify the block-builder is consuming from Kafka by checking consumer lag metrics.
Check the tempo_block_builder_flushed_blocks metric to confirm blocks are being written to the backend.
Check the tempo_block_builder_fetch_errors_total metric for Kafka fetch issues.

Diagnose and fix sampling and limits issues

If you are able to query some traces in Tempo but not others, you have come to the right section.

This could happen because of a number of reasons and some have been detailed in this blog post: Where did all my spans go? A guide to diagnosing dropped spans in Jaeger distributed tracing. This is useful if you are using the Jaeger Agent.

If you are using Grafana Alloy, continue reading the following section for metrics to monitor.

Diagnose the issue

Check if the pipeline is dropping spans. The following metrics on Grafana Alloy help determine this:

exporter_send_failed_spans_ratio_total. The value of this metric should be 0.
receiver_refused_spans_ratio_total. This value of this metric should be 0.

If the pipeline isn’t reporting any dropped spans, check whether application spans are being dropped by Tempo. The following metrics help determine this:

tempo_receiver_refused_spans. The value of tempo_receiver_refused_spans should be 0.

If the value of tempo_receiver_refused_spans is greater than 0, then the possible reason is the application spans are being dropped due to rate limiting.

Solutions

If the pipeline (Grafana Alloy) drops spans, the deployment may need to be scaled up.
There might also be issues with connectivity to Tempo backend, check Alloy logs and make sure the Tempo endpoint and credentials are correctly configured.
If Tempo drops spans, this may be due to rate limiting. Rate limiting may be appropriate and therefore not an issue. The metric simply explains the cause of the missing spans.
If you require a higher ingest volume, increase the configuration for the rate limiting by adjusting the max_traces_per_user property in the configured override limits.

Note
Check the ingestion limits page for further information on limits.

Section 3: Diagnose and fix issues with querying traces

If Tempo is correctly ingesting trace spans, then it’s time to investigate possible issues with querying the data.

Check the logs of the query-frontend. The query-frontend pod runs with two containers, query-frontend and query. Use the following command to view query-frontend logs:

kubectl logs -f pod/query-frontend-xxxxx -c query-frontend

The presence of the following errors in the log may explain issues with querying traces:

level=info ts=XXXXXXX caller=frontend.go:63 method=GET traceID=XXXXXXXXX url=/api/traces/XXXXXXXXX duration=5m41.729449877s status=500
no org id
could not dial 10.X.X.X:3200 connection refused
tenant-id not found

Possible reasons for these errors are:

The querier isn’t connected to the query-frontend. Check the value of the metric cortex_query_frontend_connected_clients exposed by the query-frontend. It should be > 0, indicating querier connections with the query-frontend.
Grafana Tempo data source isn’t configured to pass tenant-id in the Authorization header (multi-tenant deployments only).
Not connected to Tempo Querier correctly.
Insufficient permissions.

Solutions

To fix connection issues:

If the queriers aren’t connected to the query-frontend, check the following section in the querier configuration and verify the query-frontend address.
YAML
```
querier:
  frontend_worker:
    frontend_address: query-frontend-discovery.default.svc.cluster.local:9095
```
Validate the Grafana data source configuration and debug network issues between Grafana and Tempo.

To fix an insufficient permissions issue:

Verify that the querier has the LIST and GET permissions on the bucket.

Too many jobs in the queue

Thu, 28 May 2026 17:50:33 +0100

Too many jobs in the queue

The error message might also be

queue doesn't have room for 100 jobs
failed to add a job to work queue

You may see this error if the scheduler or worker isn’t running and the blocklist size has exploded.

Possible reasons why the scheduler or worker may not be running are:

Insufficient permissions.
Worker sitting idle because no block is hashing to it.
Incorrect configuration settings.

Diagnose the issue

Check metric tempodb_compaction_bytes_written_total If this is greater than zero (0), it means the worker is running and writing to the backend.
Check metric tempodb_compaction_errors_total If this metric is greater than zero (0), check the logs of the worker for an error message.

Solutions

Verify that the Worker has the LIST, GET, PUT, and DELETE permissions on the bucket objects.
- If these permissions are missing, assign them to the worker container.
- For detailed information, refer to the Amazon S3 permissions.
If there’s a worker sitting idle while others are running, check the scheduler logs and worker metrics to diagnose the issue.
Check the following configuration parameters to ensure that there are correct settings:
- max_block_bytes to determine the maximum size of a block. A good number is anywhere from 100MB to 2GB depending on the workload.
- max_compaction_objects to determine the max number of objects in a compacted block. This should relatively high, generally in the millions.
- retention_duration for how long traces should be retained in the backend.
Check the storage section of the configuration and increase queue_depth. Do bear in mind that a deeper queue could mean longer waiting times for query responses. Adjust max_workers accordingly, which configures the number of parallel workers that query backend blocks.

storage:
  trace:
    pool:
      max_workers: 100 # worker pool determines the number of parallel requests to the object store backend
      queue_depth: 10000

Bad blocks

Thu, 28 May 2026 17:50:33 +0100

Bad blocks

Queries fail with an error message containing:

error querying store in Querier.FindTraceByID: error using pageFinder (1, 5927cbfb-aabe-48b2-9df5-f4c3302d915f): ...

This might indicate that there is a bad (corrupted) block in the backend.

How blocks can get corrupted

Blocks are created by the block-builder, which consumes data from Kafka and flushes blocks to object storage. The block-builder is designed to be recoverable at every stage. The block-builder rewinds to the last Kafka commit on each cycle, clears its scratch disk, and uses deterministic block IDs so that partial flushes can be safely overwritten.

A block becomes live only once its meta.json is written to object storage. Before that point, any crash is fully recoverable. In rare cases, corruption can still occur. For example, if object storage acknowledges a write that is not fully persisted, or if the data files are corrupted during upload.

Removing bad blocks

If you encounter corrupted blocks, delete the affected blocks, which may result in some loss of data. The block-builder will replay from Kafka and rebuild any data that hasn’t been committed yet. Alternatively, you can restore the blocks from a backup, if available.

The mechanism to remove a block from the backend is backend-specific, but the block to remove will be at:

<tenant ID>/<block ID>

Tag search

Thu, 28 May 2026 17:50:33 +0100

Tag search

An issue occurs while searching for traces in Grafana Explore. The Service Name and Span Name drop down lists are empty, and there is a No options found message.

HTTP requests to Tempo query frontend endpoint at /api/search/tag/service.name/values would respond with an empty set.

Root cause

The introduction of a cap on the size of tags causes this issue.

Configuration parameter max_bytes_per_tag_values_query causes the return of an empty result when a query exceeds the configured value.

Solutions

There are two main solutions to this issue:

Reduce the cardinality of tags pushed to Tempo. Reducing the number of unique tag values will reduce the size returned by a tag search query.
Increase the max_bytes_per_tag_values_query parameter in the overrides block of your Tempo configuration to a value as high as 50MB.

Response larger than the max

Thu, 28 May 2026 17:50:33 +0100

Response larger than the max

The error message is similar to the following:

500 Internal Server Error Body: response larger than the max (<size> vs <limit>)

This error indicates that the response received or sent is too large. This can happen in multiple places, but it’s most commonly seen in the query path, with messages between the querier and the query frontend.

Solutions

Tempo server (general)

Tempo components communicate with each other via gRPC requests. To increase the maximum message size, you can increase the gRPC message size limit in the server block.

server:
  grpc_server_max_recv_msg_size: <size>
  grpc_server_max_send_msg_size: <size>

The server config block is not synchronized across components. Most likely you will need to increase the message size limit in multiple components.

Querier

Additionally, querier workers can be configured to use a larger message size limit.

querier:
    frontend_worker:
        grpc_client_config:
            max_send_msg_size: <size>

Ingestion

Lastly, message size is also limited in ingestion and can be modified in the distributor block.

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          max_recv_msg_size_mib: <size>

Long-running traces

Thu, 28 May 2026 17:50:33 +0100

Long-running traces

Long-running traces are created when Tempo receives spans for a trace, followed by a delay, and then Tempo receives additional spans for the same trace. If the delay between spans is great enough, the spans end up in different blocks, which can lead to inconsistency in a few ways:

When using TraceQL search, the duration information only pertains to a subset of the blocks that contain a trace. This happens because Tempo consults only enough blocks to know the TraceID of the matching spans. When performing a TraceID lookup, Tempo searches for all parts of a trace in all matching blocks, which yields greater accuracy when combined.
When using spanset operators, Tempo only evaluates the contiguous trace of the current block. This means that for a single block the conditions may evaluate to false, but to consider all parts of the trace from all blocks would evaluate true.

In Tempo 3.0, two components handle trace data independently:

Live-stores serve recent data. They hold traces in memory and can keep spans for the same trace together as long as the trace remains active. You can tune the live_store.max_trace_idle configuration to control when a trace is considered idle. Extending this beyond the default 5s can allow for long-running traces to be co-located, but take into account other considerations around memory consumption on the live-stores.
Block-builders consume from Kafka and build blocks for long-term storage. They do a hard cut at a certain record on each consumption cycle. All spans consumed in a cycle are flushed into blocks regardless of whether the trace is complete. This means a trace’s spans can be split across block-builder cycles with no way to keep them together.

Data quality metrics

Tempo publishes a tempo_warnings_total metric from the live-store, which can aid in understanding when this situation arises.

When a trace is flushed to the WAL in the live-store, it’s marshalled in the Parquet format which makes it available for TraceQL metrics and search. The more complete a trace is at this moment, the more accurate complex queries are. The disconnected_trace_flushed_to_wal and rootless_trace_flushed_to_wal metrics help operators measure how reliable their trace data pipeline is.

disconnected_trace_flushed_to_wal: Incremented when a trace is flushed that has a span with parent id that cannot be found.
rootless_trace_flushed_to_wal: Incremented when a trace is flushed that doesn’t have a root span. A root span is a span with all 0 parent id.

You might see these data quality metrics if you use a Prometheus query like this to explore Tempo warnings:

sum(rate(tempo_warnings_total{}[5m])) by (reason)

This example helps determine the percentage of complete traces flushed. This metric can help you optimize your instrumentation and traces pipeline and understand the impact it has on Tempo data quality.

In particular, the following query can be used to know what percentage of traces flushed to the WAL are connected.

1 - sum(rate(tempo_warnings_total{reason="disconnected_trace_flushed_to_wal"}[5m])) / sum(rate(tempo_live_store_traces_created_total{}[5m]))

If you have long-running traces, you may also be interested in the rootless_trace_flushed_to_wal reason to know when a trace is flushed to the WAL without a root span.

You can use reason fields for discovery with this query:

sum(rate(tempo_warnings_total{}[5m])) by (reason)

In general, Tempo functions at its peak when all parts of a trace are stored within as few blocks as possible. There is a wide variety of tracing patterns in the wild, which makes it impossible to optimize for all of them.

While the preceding information can help determine what Tempo is doing, it may be worth modifying the usage pattern slightly. For example, you may want to use span links, so that traces are split up, allowing one trace to complete, while pointing to the next trace in the causal chain . This allows both traces to finish in a shorter duration, and increase the chances of ending up in the same block.

Too many requests error

Thu, 28 May 2026 17:50:33 +0100

Too many requests (429 error code)

if an issue occurs during a Tempo query, the error response may look like:

429 failed to execute TraceQL query: {resource.service.name != nil} | rate() by(resource.service.name) Status: 429 Too Many Requests Body: job queue full

Root cause

Tempo parallelizes work by breaking a single query into multiple requests (jobs) that are distributed to the queriers. Increasing the time range results in more jobs being created. To ensure fair resource usage and to prevent the “noisy neighbor” problem in multi-tenant environments, Tempo limits the number of jobs a tenant can run concurrently. The maximum number of jobs per tenant is controlled by the query-frontend setting max_outstanding_per_tenant.

Solutions

There are two main solutions to this issue:

Reduce the time range of the query.
Increase the max_outstanding_per_tenant parameter in the query-frontend configuration from the default of 2000 jobs.

query-frontend:
  max_outstanding_per_tenant:: <max number of jobs>

Issues with querying on Grafana Labs

Unable to find traces

Unable to find traces

Section 1: Diagnose and fix ingestion issues

Case 1 - tempo_distributor_spans_received_total is 0

Solutions

Case 2 - tempo_live_store_traces_created_total is 0

Solution

Case 3 - Live-store Kafka lag

Solution

Case 4 - Trace is not recent

Solution

Diagnose and fix sampling and limits issues

Diagnose the issue

Solutions

Section 3: Diagnose and fix issues with querying traces

Solutions

Too many jobs in the queue

Too many jobs in the queue

Diagnose the issue

Solutions

Bad blocks

Bad blocks

How blocks can get corrupted

Removing bad blocks

Tag search

Tag search

Root cause

Solutions

Response larger than the max

Response larger than the max

Solutions

Tempo server (general)

Querier

Ingestion

Long-running traces

Long-running traces

Data quality metrics

Too many requests error

Too many requests (429 error code)

Root cause

Solutions

Case 1 - `tempo_distributor_spans_received_total` is 0