This is documentation for the next version of Tempo. For the latest stable release, go to the latest version.
Distributor refusing spans
The two most likely causes of refused spans are unhealthy ingesters or trace limits being exceeded.
To log spans that are discarded, add the --distributor.log_discarded_spans.enabled
flag to the distributor or
adjust the distributor config:
distributor:
log_discarded_spans:
enabled: true
include_all_attributes: false # set to true for more verbose logs
Adding the flag logs all discarded spans, as shown below:
level=info ts=2024-08-19T16:06:25.880684385Z caller=distributor.go:767 msg=discarded spanid=c2ebe710d2e2ce7a traceid=bd63605778e3dbe935b05e6afd291006
level=info ts=2024-08-19T16:06:25.881169385Z caller=distributor.go:767 msg=discarded spanid=5352b0cb176679c8 traceid=ba41cae5089c9284e18bca08fbf10ca2
Unhealthy ingesters
Unhealthy ingesters can be caused by failing OOMs or scale down events. If you have unhealthy ingesters, your log line will look something like this:
msg="pusher failed to consume trace data" err="at least 2 live replicas required, could only find 1"
In this case, you may need to visit the ingester ring page at /ingester/ring
on the Distributors
and “Forget” the unhealthy ingesters. This will work in the short term, but the long term fix is to stabilize your ingesters.
Trace limits reached
In high volume tracing environments, the default trace limits are sometimes not sufficient. These limits exist to protect Tempo and prevent it from OOMing, crashing or otherwise allow tenants to not DOS each other. If you are refusing spans due to limits, you will see logs like this at the distributor:
msg="pusher failed to consume trace data" err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (52428800) exceeded while adding 15632 bytes to trace a0fbd6f9ac5e2077d90a19551dd67b6f for tenant single-tenant"
msg="pusher failed to consume trace data" err="rpc error: code = FailedPrecondition desc = LIVE_TRACES_EXCEEDED: max live traces per tenant exceeded: per-user traces limit (local: 60000 global: 0 actual local: 60000) exceeded"
msg="pusher failed to consume trace data" err="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (15000000 bytes) exceeded while adding 10 bytes"
You will also see the following metric incremented. The reason
label on this metric will contain information about the refused reason.
tempo_discarded_spans_total
In this case, use available configuration options to increase limits.
Client resets connection
When the client resets the connection before the distributor can consume the trace data, you see logs like this:
msg="pusher failed to consume trace data" err="context canceled"
This issue needs to be fixed on the client side. To inspect which clients are causing the issue, logging discarded spans
with include_all_attributes: true
may help.
Note that there may be other reasons for a closed context as well. Identifying the reason for a closed context is not straightforward and may require additional debugging.