Troubleshoot Tempo on Grafana Labs

Issues with sending traces

Thu, 28 May 2026 17:50:33 +0100

Issues with sending traces

Learn about issues related to sending traces.

Distributor refusing spans
Troubleshoot distributor refusing spans
Troubleshoot Grafana Alloy
Gain visibility on how many traces are being pushed to Grafana Alloy and if they are making it to the Tempo backend.

Issues with querying

Thu, 28 May 2026 17:50:33 +0100

Issues with querying

Learn about issues related to querying.

Unable to find traces
Troubleshoot missing traces
Too many jobs in the queue
Troubleshoot too many jobs in the queue
Bad blocks
Troubleshoot queries failing with an error message indicating bad blocks.
Tag search
Troubleshoot No options found in Grafana tag search
Response larger than the max
Troubleshoot response larger than the max error message
Long-running traces
Troubleshoot search results when using long-running traces
Too many requests error
Troubleshoot Too many requests error for a Tempo query

Troubleshoot metrics-generator

Thu, 28 May 2026 17:50:33 +0100

Troubleshoot metrics-generator

If you’re concerned with data quality issues in the metrics-generator, consider:

Reviewing your telemetry pipeline to determine the number of dropped spans. You are only looking for major issues here.
Reviewing the service graph documentation to understand how they are built.

If everything seems acceptable from these two perspectives, consider the following topics to help resolve general issues with all metrics and span metrics specifically.

Kafka consumption

In Tempo 3.0 microservices mode, metrics-generators consume trace data directly from Kafka rather than receiving pushes from distributors. In monolithic mode, the distributor still pushes directly to the in-process metrics-generator. If the generator is not producing metrics in a microservices deployment, start by verifying that it’s consuming data from Kafka using the metrics below.

Consumer lag

Use the following metrics to monitor the generator’s Kafka consumer lag:

tempo_ingest_group_partition_lag{group="metrics-generator"}
tempo_ingest_group_partition_lag_seconds{group="metrics-generator"}

tempo_ingest_group_partition_lag tracks the lag in number of records per partition, while tempo_ingest_group_partition_lag_seconds tracks the lag in seconds. High or growing lag indicates the generator is falling behind.

Kafka client errors

The generator uses the tempo_ingest_storage_reader family of metrics (provided by the Kafka client library) to expose detailed information about fetch operations, errors, and throughput. Look for error and failure metrics in this family to diagnose connectivity or protocol issues with Kafka.

All metrics

This section covers additional metrics related to the metrics-generator.

Discarded spans in the generator

Spans are rejected from being considered by the metrics-generator by a configurable slack time as well as due to user configurable filters. You can see the number of spans rejected by reason using this metric:

sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason)

If a lot of spans are dropped in the metrics-generator due to your filters, you will need to adjust them. If spans are dropped due to the ingestion slack time, consider adjusting this setting:

metrics_generator:
  metrics_ingestion_time_range_slack: 30s

If spans are regularly exceeding this value you may want to consider reviewing your tracing pipeline to see if you have excessive buffering. Note that increasing this value allows the generator to consume more spans, but does reduce the accuracy of metrics because spans farther away from “now” are included.

Spans could also be discarded if the attributes aren’t valid UTF-8 characters when those attributes are converted to metric labels.

Max active series

The generator protects itself and your remote-write target by having a maximum number of series the generator produces. Use the sum below to determine if series are being dropped due to this limit:

sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m]))

Use the following setting to update the limit:

overrides:
  defaults:
    metrics_generator:
      max_active_series: 0

Note that this value is per metrics generator. The actual max series remote written will be <# of metrics generators> * <metrics_generator.max_active_series>.

Overflow series

When the active series limit is reached, the metrics-generator produces overflow series instead of dropping new data. These series have the label metric_overflow="true" and capture all data that would otherwise be lost.

To identify overflow series in your metrics:

{metric_overflow="true"}

As existing series become stale and are removed, new series are split out from the overflow bucket until the limit is reached again. To reduce overflow, either increase max_active_series or reduce cardinality by adjusting dimensions or filters.

Entity-based limiting

You can configure entity-based limiting as an alternative to series-based limiting. An entity is a unique label combination (excluding external labels) across multiple metrics. Entity-based limiting ensures the generator always produces the full set of metrics for a given entity, rather than limiting randomly once the series limit is triggered.

To enable entity-based limiting, set limiter_type to entity:

metrics_generator:
  limiter_type: entity

Use the following metric to determine if entities are being limited:

sum(rate(tempo_metrics_generator_registry_entities_limited_total{}[1m]))

Configure the entity limit with:

overrides:
  defaults:
    metrics_generator:
      max_active_entities: 0

Per-label cardinality limiting

The per-label cardinality limiter caps the number of distinct values any single label can have. When a label exceeds the configured threshold, its value is replaced with __cardinality_overflow__ while all other labels that are under the limit are preserved.

For example, if the url label exceeds the cardinality limit:

Before:

{service="foo", method="GET", url="/users/1"}
{service="foo", method="GET", url="/users/2"}
{service="foo", method="GET", url="/users/3"}
...

After:

{service="foo", method="GET", url="__cardinality_overflow__"}

Once the limiter kicks in, new url values are replaced with __cardinality_overflow__. Labels that remain under the limit, like method, are unaffected.

To detect if per-label cardinality limiting is active:

sum by (tenant, label_name) (rate(tempo_metrics_generator_registry_label_values_limited_total{}[5m]))

To view the estimated cardinality demand per label:

tempo_metrics_generator_registry_label_cardinality_demand_estimate{}

Use this metric to identify which labels have high cardinality, how far they exceed the configured limit, and to choose an appropriate max_cardinality_per_label value. To observe actual demand before enforcing a limit, deploy with a high max_cardinality_per_label value first.

Understand the `label_name` values in this metric

The label_name label values represent every label tracked by the per-label cardinality limiter. These include all labels that flow through the metrics-generator registry, not just user-configured dimensions.

Built-in labels:

Label	Processor	When added	Description
`service`	span-metrics	Always	The service name
`span_name`	span-metrics	Always	The operation or span name
`span_kind`	span-metrics	Always	The span kind (SERVER, CLIENT, etc.)
`status_code`	span-metrics	Always	The span status code
`job`	span-metrics	`enable_target_info` is `true`	The job name, derived from resource attributes
`instance`	span-metrics	`enable_target_info` and `enable_instance_label` are both `true`	The instance ID, derived from resource attributes
`client`	service-graphs	Always	The client service name
`server`	service-graphs	Always	The server service name
`connection_type`	service-graphs	Always	The connection type (virtual, database, messaging_system)

Configured labels include:

Span-metrics dimensions are added as-is. For example, deployment.environment becomes deployment_environment.
Service-graphs dimensions are prefixed with client_ and server_ when enable_client_server_prefix is true. For example, deployment.environment becomes client_deployment_environment and server_deployment_environment.
A configured dimension only appears if the corresponding attribute exists on incoming spans.

Configure the per-label cardinality limit:

overrides:
  defaults:
    metrics_generator:
      max_cardinality_per_label: 0

A value of 0 (default) disables the limit.

This setting works alongside both active series limiting (max_active_series) and entity-based limiting (max_active_entities). The per-label limiter runs during label construction, preventing any single high-cardinality label from consuming the entire active series or entity budget.

The per-label limiter uses HyperLogLog sketches to estimate cardinality, so the limit is approximate with a 3.25% standard error. Estimates are re-evaluated every few seconds, which means there may be a brief delay between a label crossing the threshold and the limiter taking effect.

If a high-cardinality label’s cardinality is later reduced (for example, by fixing instrumentation), the limiter automatically recovers and allows label values through again. No configuration changes are needed.

Recovery is not immediate. The limiter tracks cardinality over a sliding window (based on the registry’s stale_duration). It takes at least that duration or longer for existing high-cardinality labels to age out before the label values are allowed through again.

Estimate active series demand

When the active series limit is reached, the tempo_metrics_generator_registry_active_series metric no longer reflects the true demand. Use the tempo_metrics_generator_registry_active_series_demand_estimate metric to estimate what the active series count would be without the limit:

tempo_metrics_generator_registry_active_series_demand_estimate{}

This metric uses HyperLogLog estimation and has approximately 3% deviation from the actual cardinality. Use this to determine if you need to increase limits or reduce cardinality.

Span name sanitization

If span_name is one of the highest-cardinality labels in your setup, the span_name_sanitization option can reduce it by grouping similar span names and replacing variable segments. For example, GET /users/123 and GET /users/456 are both mapped to GET /users/<_>.

To evaluate the potential impact without modifying metrics, set span_name_sanitization to dry_run:

overrides:
  defaults:
    metrics_generator:
      span_name_sanitization: "dry_run"

After a few minutes, compare the demand estimate against current active series:

tempo_metrics_generator_registry_post_sanitization_demand_estimate{}

If this value is significantly lower than tempo_metrics_generator_registry_active_series, switch to enabled to apply the reduction.

After you enable the option, use the following metric to confirm spans are being sanitized:

rate(tempo_metrics_generator_registry_spans_sanitized_total{}[5m])

If this rate is zero after enabling, the DRAIN model hasn’t found patterns yet. This is expected for workloads with already-consistent span naming. The model trains continuously and adapts as new span names arrive.

For more details on configuration and usage, refer to Reduce cardinality with span name sanitization.

Remote write failures

For any number of reasons, the generator may fail a write to the remote write target. Use the following metrics to determine if that’s happening:

sum(rate(prometheus_remote_storage_samples_failed_total{}[1m]))
sum(rate(prometheus_remote_storage_samples_dropped_total{}[1m]))
sum(rate(prometheus_remote_storage_exemplars_failed_total{}[1m]))
sum(rate(prometheus_remote_storage_exemplars_dropped_total{}[1m]))

Service graph metrics

Service graphs have additional configuration which can impact the quality of the output metrics.

Expired edges

The following metrics can be used to determine how many edges are failing to find a match. The expired edge only includes those edges that are expired and have no matching information to generate a service graph edge.

Rate of edges that have expired without a match:

sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m]))

Rate of all edges:

sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m]))

If you are seeing a large number of edges expire without a match, consider adjusting the wait setting. This controls how long the metrics generator waits to find a match before it gives up.

metrics_generator:
  processor:
    service_graphs:
      wait: 10s

Service graph max items

The service graph processor has a maximum number of edges it tracks at once to limit the total amount of memory the processor uses. To determine if edges are being dropped due to this limit, check:

sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m]))

Use max_items to adjust the maximum amount of edges tracked:

metrics_generator:
  processor:
    service_graphs:
      max_items: 10000

Troubleshoot out-of-memory errors

Thu, 28 May 2026 17:50:33 +0100

Troubleshoot out-of-memory errors

Learn about out-of-memory (OOM) issues and how to troubleshoot them.

Set the max attribute size to help control out of memory errors

Tempo queriers can run out of memory when fetching traces that have spans with very large attributes. This issue has been observed when trying to fetch a single trace using the tracebyID endpoint.

To avoid these out-of-memory crashes, use max_attribute_bytes to limit the maximum allowable size of any individual attribute. Any key or values that exceed the configured limit are truncated before storing.

Use the tempo_distributor_attributes_truncated_total metric to track how many attributes are truncated. This metric includes tenant and scope labels, where scope is one of resource, scope, span, event, or link. Use the scope label to identify which part of your trace data produces the most oversized attributes.

When truncation occurs, the distributor also emits a rate-limited log line (at most one per second) with an example of the truncated attribute, including its scope, name, whether the key or value was truncated, and the original size in bytes.

   # Optional
    # Configures the max size an attribute can be. Any key or value that exceeds this limit will be truncated before storing
    # Setting this parameter to '0' would disable this check against attribute size
    [max_attribute_bytes: <int> | default = '2048']

Refer to the configuration for distributors documentation for more information.

Max trace size

Traces which are long-running (minutes or hours) or large (100K - 1M spans) spike the memory usage of each component when the large trace is encountered. Tempo treats traces as single units, and keeps all data for a trace together to enable features like structural queries and analysis.

Reading a large trace can spike the memory usage of the read components:

query-frontend
querier
live-store
metrics-generator

Writing a large trace can spike the memory usage of the write components:

live-store
block-builder
metrics-generator

Start with a smaller trace size limit of 15MB, and increase it as needed. With an average span size of 300 bytes, this allows for 50K spans per trace.

Verify that you’ve configured a limit in max_bytes_per_trace. The largest recommended limit is 60MB.

Configure the limit in the per-tenant overrides:

overrides:
    'tenant123':
        max_bytes_per_trace: 1.5e+07

Refer to the Standard overrides documentation for more information.

If you have long-running batch job traces, consider using span links to break them apart.

Large attributes

Very large attributes, 10KB or longer, can spike the memory usage of each component when they are encountered. Tempo’s Parquet format uses dictionary-encoded columns, which works well for repeated values. However, for very large and high cardinality attributes, this can require a large amount of memory.

A common source of large attributes is auto-instrumentation in these areas:

HTTP
- Request or response bodies
- Large headers
  - http.request.header.<key>
- Large URLs
  - http.url
  - url.full
Databases
- Full query statements
- db.statement
- db.query.text
Queues
- Message bodies

When reading these attributes, they can spike the memory usage of the read components:

query-frontend
querier
live-store
metrics-generator

When writing these attributes, they can spike the memory usage of the write components:

live-store
block-builder
metrics-generator

You can automatically limit attribute sizes using max_attribute_bytes. You can also use these options:

Manually update application instrumentation to remove or limit these attributes
Drop the attributes in the tracing pipeline using attribute processor

Troubleshoot Tempo on Grafana Labs

Issues with sending traces

Issues with sending traces

Issues with querying

Issues with querying

Troubleshoot metrics-generator

Troubleshoot metrics-generator

Kafka consumption

Consumer lag

Kafka client errors

All metrics

Discarded spans in the generator

Max active series

Overflow series

Entity-based limiting

Per-label cardinality limiting

Understand the label_name values in this metric

Estimate active series demand

Span name sanitization

Remote write failures

Service graph metrics

Expired edges

Service graph max items

Troubleshoot out-of-memory errors

Troubleshoot out-of-memory errors

Set the max attribute size to help control out of memory errors

Max trace size

Large attributes

Understand the `label_name` values in this metric