This is documentation for the next version of Grafana Tempo documentation. For the latest stable release, go to the latest version.

Documentationbreadcrumb arrow Grafana Tempobreadcrumb arrow Managebreadcrumb arrow Manage trace ingestion
Open source

Manage trace ingestion

If you are seeing RATE_LIMITED, LIVE_TRACES_EXCEEDED, or TRACE_TOO_LARGE errors, or if your trace storage costs are rising unexpectedly, this page can help.

Grafana Tempo enforces ingestion limits at multiple points in the write path. The distributor checks rate limits before writing spans to Kafka. Downstream, live-stores enforce per-trace size and live trace count limits asynchronously, and block-builders enforce per-trace size limits. If limits are too low for your workload, spans are discarded and data is lost. If limits are unchecked, ingestion volume can grow beyond what you intended.

This page covers three tasks:

For an overview of how trace data flows through the write path, refer to Tempo architecture.

Size ingestion limits for your workload

Tempo enforces three ingestion limits. Understanding what the defaults mean for your traffic helps you set appropriate values at deploy time, rather than discovering them through production incidents.

Rate limit

rate_limit_bytes (default: 15,000,000) sets the sustained byte rate each distributor allows per tenant, measured in bytes per second. For a typical span size of around 500 bytes, the default accommodates roughly 30,000 spans per second per distributor.

How this scales depends on your rate_strategy:

  • local (default): each distributor enforces the limit independently. With three distributors, the effective cluster limit is approximately 90,000 spans per second.
  • global: the configured rate is shared across all distributors. The total cluster rate equals the configured value regardless of how many distributors you run.

burst_size_bytes (default: 20,000,000) allows temporary spikes above the sustained rate, for example during application deployments. The burst allowance is always applied locally, regardless of rate strategy.

Live trace limit

max_traces_per_user (default: 10,000) caps the number of concurrently active traces per tenant on each live-store. This limit is enforced asynchronously in the live-store, not at ingestion time in the distributor. Block-builders do not enforce this limit. If your services produce many short-lived traces in parallel, you may need to raise this.

max_global_traces_per_user (default: 0, disabled) sets a cluster-wide cap instead of a per-instance cap. This setting only takes effect when using the classic ingester write path, not the Kafka-based live-store path.

Per-trace size limit

max_bytes_per_trace (default: 5,000,000) caps the total size of a single trace. This limit is enforced asynchronously in live-stores and block-builders. Traces that exceed this limit are partially dropped. Unusually large traces often indicate a retry loop or misconfigured instrumentation rather than normal application behavior.

Example configuration

To estimate the rate limit you need, multiply your average span size by your peak spans-per-second across all services for a given tenant.

The following example raises the defaults for a high-throughput workload:

YAML
overrides:
  defaults:
    ingestion:
      rate_strategy: local
      rate_limit_bytes: 30000000
      burst_size_bytes: 40000000
      max_traces_per_user: 50000
    global:
      max_bytes_per_trace: 10000000

If you run a multi-tenant deployment, you can set different limits per tenant using runtime overrides instead of raising the global defaults. Refer to Enable multi-tenancy for per-tenant override examples.

For the full list of available settings, refer to Ingestion limits in the configuration reference. You can also manage per-tenant limits through the API using user-configurable overrides.

Find and fix discarded spans

When a span exceeds an ingestion limit, Tempo discards it and increments the tempo_discarded_spans_total metric. The distributor discards rate-limited spans before they reach Kafka. Live-stores discard spans that exceed per-trace size or live trace count limits after consuming them from Kafka. Block-builders discard spans that exceed per-trace size limits.

Error reference

The following table lists the three error types, what each one means, and how to fix it.

ErrorCauseFix
RATE_LIMITEDThe tenant’s byte rate exceeded rate_limit_bytes.Raise rate_limit_bytes, or add distributors if using rate_strategy: local. If volume is genuinely higher than intended, reduce it upstream with sampling.
LIVE_TRACES_EXCEEDEDThe number of concurrent active traces on a live-store exceeded max_traces_per_user.Raise max_traces_per_user. If using the classic ingester path, you can also set max_global_traces_per_user to distribute the limit across the cluster.
TRACE_TOO_LARGEA single trace exceeded max_bytes_per_trace (default 5 MB).Raise max_bytes_per_trace in the global overrides. Also investigate why the trace is so large. Common causes include retry loops and misconfigured instrumentation.

Check which limit is being hit

Query the tempo_discarded_spans_total metric. The reason label indicates which limit caused the refusal:

promql
sum by (reason) (rate(tempo_discarded_spans_total[5m]))

Log discarded spans for debugging

To log spans discarded by the distributor (rate-limited spans) with their trace IDs, enable log_discarded_spans in the distributor configuration:

YAML
distributor:
  log_discarded_spans:
    enabled: true

Set include_all_attributes: true for more verbose output that includes span attributes.

Spans discarded by live-stores for LIVE_TRACES_EXCEEDED or TRACE_TOO_LARGE are logged at debug level by the live-store. To see these entries, set the live-store log level to debug.

Refer to Distributor refusing spans for additional troubleshooting steps.

Traces missing from queries without errors

If the distributor is not refusing spans but traces are missing from query results, the issue may be downstream. Live-stores consume trace data from Kafka and serve recent queries. If a live-store falls behind its Kafka partition, query results may be incomplete.

The fail_on_high_lag setting (default false) controls this behavior:

  • When false, the live-store returns whatever data it has, which may be incomplete.
  • When true, the live-store returns an error when it cannot guarantee completeness.

Refer to Unable to find traces for query-side troubleshooting.

Identify what is driving ingestion volume

When your cluster is healthy but ingestion is growing, the first step is finding which services are responsible for the most volume.

Set up cost attribution

The usage tracker breaks down ingested bytes by configurable attributes, giving you a per-service view of who is consuming capacity.

Enable cost attribution in the distributor and configure which attributes to track in overrides:

YAML
distributor:
  usage:
    cost_attribution:
      enabled: true

overrides:
  defaults:
    cost_attribution:
      dimensions:
        resource.service.name: "service"

Find the top contributors

After enabling cost attribution, the distributor exposes the tempo_usage_tracker_bytes_received_total metric on the /usage_metrics endpoint, labeled by the dimensions you configured.

You can query this endpoint directly:

Bash
curl http://<distributor-host>:3200/usage_metrics

If you scrape this endpoint with Prometheus, you can use the following query to find which services are sending the most data:

promql
topk(10,
  sum by (service) (
    rate(tempo_usage_tracker_bytes_received_total[1h])
  )
)

For the full set of configuration options, including scoping dimensions by resource or span and customizing label names, refer to Usage tracker.

Reduce volume from noisy services

After you know which services are driving volume, the most effective way to reduce it is sampling at the collector layer. Tail sampling lets you keep traces with errors or high latency while dropping routine ones, reducing volume without losing visibility into the problems that matter.