Configure Tempo
You can use Grafana Cloud to avoid installing, maintaining, and scaling your own instance of Grafana Tempo. Create a free account to get started, which includes free forever access to 10k metrics, 50GB logs, 50GB traces, 500VUh k6 testing & more.
This document explains the configuration options for Tempo as well as the details of what they impact.
Tip
Instructions for configuring Tempo data sources are available in the Grafana Cloud and Grafana documentation.
The Tempo configuration options include:
- Configure Tempo
Additionally, you can review TLS to configure the cluster components to communicate over TLS, or receive traces over TLS.
Use environment variables in the configuration
You can use environment variable references in the configuration file to set values that need to be configurable during deployment. To do this, pass -config.expand-env=true
and use:
${VAR}
Where VAR
is the name of the environment variable.
Each variable reference is replaced at startup by the value of the environment variable. The replacement is case-sensitive and occurs before the YAML file is parsed. References to undefined variables are replaced by empty strings unless you specify a default value or custom error text.
To specify a default value, use:
${VAR:-default_value}
where default_value
is the value to use if the environment variable is undefined.
You can find more about other supported syntax here.
Server
Tempo uses the server from dskit/server
. For more information on configuration options, refer to this file.
Distributor
For more information on configuration options, refer to this file.
Distributors receive spans and forward them to the appropriate ingesters.
The following configuration enables all available receivers with their default configuration. For a production deployment, enable only the receivers you need. Additional documentation and more advanced configuration options are available in the receiver README.
Set max attribute size to help control out of memory errors
Tempo queriers can run out of memory when fetching traces that have spans with very large attributes.
This issue has been observed when trying to fetch a single trace using the tracebyID
endpoint.
While a trace might not have a lot of spans (roughly 500), it can have a larger size (approximately 250KB).
Some of the spans in that trace had attributes whose values were very large in size.
To avoid these out-of-memory crashes, use max_span_attr_byte
to limit the maximum allowable size of any individual attribute.
Any key or values that exceed the configured limit are truncated before storing.
The default value is 2048
.
Use the tempo_distributor_attributes_truncated_total
metric to track how many attributes are truncated.
For additional information, refer to Troubleshoot out-of-memory errors.
gRPC compression
Starting with Tempo 2.7.1, gRPC compression between all components defaults to snappy
.
Using snappy
provides a balanced approach to compression between components that will work for most installations.
If you prefer a different balance of CPU/Memory and bandwidth, consider disabling compression or using zstd
.
For a discussion on alternatives, refer to this discussion thread. (#4696).
Disabling comrpession may provide some performance boosts. Benchmark testing suggested that without compression, queriers and distributors used less CPU and memory.
However, you may notice an increase in ingester data and network traffic especially for larger clusters. This increased data can impact billing for Grafana Cloud.
You can configure the gRPC compression in the querier
, ingester
, and metrics_generator
clients of the distributor.
To disable compression, remove snappy
from the grpc_compression
lines.
To re-enable the compression, use snappy
with the following settings:
ingester_client:
grpc_client_config:
grpc_compression: "snappy"
metrics_generator_client:
grpc_client_config:
grpc_compression: "snappy"
querier:
frontend_worker:
grpc_client_config:
grpc_compression: "snappy"
Ingester
For more information on configuration options, refer to this file.
The ingester is responsible for batching up traces and pushing them to TempoDB.
A live, or active, trace is a trace that has received a new batch of spans in more than a configured amount of time (default 10 seconds, set by ingester.trace_idle_period
).
After 10 seconds (or the configured amount of time), the trace is flushed to disk and appended to the WAL.
When Tempo receives a new batch, a new live trace is created in memory.
Metrics-generator
For more information on configuration options, refer to this file.
The metrics-generator processes spans and write metrics using the Prometheus remote write protocol. For more information on the metrics-generator, refer to the Metrics-generator documentation.
Metrics-generator processors are disabled by default. To enable it for a specific tenant, set metrics_generator.processors
in the overrides section.
Note
If you want to enable metrics-generator for your Grafana Cloud account, refer to the Metrics-generator in Grafana Cloud documentation.
You can limit spans with end times that occur within a configured duration to be considered in metrics generation using metrics_ingestion_time_range_slack
.
In Grafana Cloud, this value defaults to 30 seconds so all spans sent to the metrics-generation more than 30 seconds in the past are discarded or rejected.
For more information about the local-blocks
configuration option, refer to TraceQL metrics.
Query-frontend
For more information on configuration options, refer to this file.
The Query Frontend is responsible for sharding incoming requests for faster processing in parallel (by the queriers).
Limit query size to improve performance and stability
Querying large tracing data presents several challenges. Span sets with large number of spans impact query performance and stability. In a similar manner, excessive queries result size can also negatively impact query performance.
Limit the spans per spanset
You can set the maximum spans per spanset by setting max_spans_per_span_set
for the query-frontend.
The default value is 100.
In Grafana or Grafana Cloud, you can use the Span Limit field in the TraceQL query editor in Grafana Explore.
This field sets the maximum number of spans to return for each span set.
The maximum value that you can set for the Span Limit value (or the spss query) is controlled by max_spans_per_span_set
.
To disable the maximum spans per span set limit, set max_spans_per_span_set
to 0
.
When set to 0
, there is no maximum and users can put any value in Span Limit.
However, this can only be set by a Tempo administrator, not by the user.
Cap the maximum query length
You can set the maximum length of a query using query_frontend.max_query_expression_size_bytes
configuration parameter for the query-frontend. The default value is 128 KB.
This limit is used to protect the system’s stability from potential abuse or mistakes, when running a large potentially expensive query.
You can set the value lower of higher by setting it in the query_frontend
configuration section, for example:
query_frontend:
max_query_expression_size_bytes: 10000
Querier
For more information on configuration options, refer to this file.
The Querier is responsible for querying the backends/cache for the traceID.
It also queries compacted blocks that fall within the (2 * BlocklistPoll) range where the value of Blocklist poll duration is defined in the storage section below.
Compactor
For more information on configuration options, refer to this file.
Compactors stream blocks from the storage backend, combine them and write them back. Values shown below are the defaults.
Storage
Tempo supports Amazon S3, GCS, Azure, and local file system for storage. In addition, you can use Memcached or Redis for increased query performance.
For more information on configuration options, refer to this file.
Local storage recommendations
While you can use local storage, object storage is recommended for production workloads. A local backend won’t correctly retrieve traces with a distributed deployment unless all components have access to the same disk. Tempo is designed for object storage more than local storage.
At Grafana Labs, we’ve run Tempo with SSDs when using local storage. Hard drives haven’t been tested.
You can estimate how much storage space you need by considering the ingested bytes and retention. For example, ingested bytes per day times retention days = stored bytes.
You can not use both local and object storage in the same Tempo deployment.
Storage block configuration example
The storage block configures TempoDB. The following example shows common options. For further platform-specific information, refer to the following:
Memberlist
Memberlist is the default mechanism for all of the Tempo pieces to coordinate with each other.
Configuration blocks
Defines re-used configuration blocks.
Block config
Filter policy config
Span filter config block
Filter policy
# Exclude filters (positive matching)
[include: <policy match>]
# Exclude filters (negative matching)
[exclude: <policy match>]
Policy match
# How to match the value of attributes
# Options: "strict", "regex"
[match_type: <string>]
# List of attributes to match
attributes: <list of policy atributes>
# Attribute key
- [key: <string>]
# Attribute value
[value: <any>]
Examples
exclude:
match_type: "regex"
attributes:
- key: "resource.service.name"
value: "unknown_service:myservice"
include:
match_type: "strict"
attributes:
- key: "foo.bar"
value: "baz"
KVStore config
The kvstore configuration block
Search config
# Target number of bytes per GET request while scanning blocks. Default is 1MB. Reducing
# this value could positively impact trace search performance at the cost of more requests
# to object storage.
[chunk_size_bytes: <uint32> | default = 1000000]
# Number of traces to prefetch while scanning blocks. Default is 1000. Increasing this value
# can improve trace search performance at the cost of memory.
[prefetch_trace_count: <int> | default = 1000]
# Number of read buffers used when performing search on a vparquet block. This value times the read_buffer_size_bytes
# is the total amount of bytes used for buffering when performing search on a parquet block.
[read_buffer_count: <int> | default = 32]
# Size of read buffers used when performing search on a vparquet block. This value times the read_buffer_count
# is the total amount of bytes used for buffering when performing search on a parquet block.
[read_buffer_size_bytes: <int> | default = 1048576]
# Granular cache control settings for parquet metadata objects
# Deprecated. See [Cache](#cache) section.
cache_control:
# Specifies if footer should be cached
[footer: <bool> | default = false]
# Specifies if column index should be cached
[column_index: <bool> | default = false]
# Specifies if offset index should be cached
[offset_index: <bool> | default = false]
WAL config
The storage WAL configuration block.
# Where to store the wal files while they are being appended to.
# Must be set.
# Example: "/var/tempo/wal
[path: <string> | default = ""]
# WAL encoding/compression.
# options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
[v2_encoding: <string> | default = "zstd" ]
# Defines the search data encoding/compression protocol.
# Options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
[search_encoding: <string> | default = "snappy"]
# When a span is written to the WAL it adjusts the start and end times of the block it is written to.
# This block start and end time range is then used when choosing blocks for search.
# This is also used for querying traces by ID when the start and end parameters are specified. To prevent spans too far
# in the past or future from impacting the block start and end times we use this configuration option.
# This option only allows spans that occur within the configured duration to adjust the block start and
# end times.
# This can result in trace not being found if the trace falls outside the slack configuration value as the
# start and end times of the block will not be updated in this case.
[ingestion_time_range_slack: <duration> | default = unset]
# WAL file format version
# Options: v2, vParquet, vParquet2, vParquet3
[version: <string> | default = "vParquet3"]
Overrides
Tempo provides an overrides module for users to set global or per-tenant override settings.
Ingestion limits
The default limits in Tempo may not be sufficient in high-volume tracing environments.
Errors including RATE_LIMITED
/TRACE_TOO_LARGE
/LIVE_TRACES_EXCEEDED
occur when these limits are exceeded.
See below for how to override these limits globally or per tenant.
Standard overrides
You can create an overrides
section to configure ingestion limits that apply to all tenants of the cluster.
A snippet of a config.yaml
file showing how the overrides section is here.
Tenant-specific overrides
There are two types of tenant-specific overrides:
- runtime overrides
- user-configurable overrides
Runtime overrides
You can set tenant-specific overrides settings in a separate file and point per_tenant_override_config
to it.
This overrides file is dynamically loaded.
It can be changed at runtime and reloaded by Tempo without restarting the application.
These override settings can be set per tenant.
# /conf/tempo.yaml
# Overrides configuration block
overrides:
per_tenant_override_config: /conf/overrides.yaml
---
# /conf/overrides.yaml
# Tenant-specific overrides configuration
overrides:
"<tenant-id>":
ingestion:
[burst_size_bytes: <int>]
[rate_limit_bytes: <int>]
[max_traces_per_user: <int>]
global:
[max_bytes_per_trace: <int>]
# A "wildcard" override can be used that will apply to all tenants if a match is not found otherwise.
"*":
ingestion:
[burst_size_bytes: <int>]
[rate_limit_bytes: <int>]
[max_traces_per_user: <int>]
global:
[max_bytes_per_trace: <int>]
User-configurable overrides
These tenant-specific overrides are stored in an object store and can be modified using API requests. User-configurable overrides have priority over runtime overrides. Refer to user-configurable overrides for more details.
Override strategies
The trace limits specified by the various parameters are, by default, applied as per-distributor limits.
For example, a max_traces_per_user
setting of 10000 means that each distributor within the cluster has a limit of 10000 traces per user.
This is known as a local
strategy in that the specified trace limits are local to each distributor.
A setting that applies at a local level is quite helpful in ensuring that each distributor independently can process traces up to the limit without affecting the tracing limits on other distributors.
However, as a cluster grows quite large, this can lead to quite a large quantity of traces.
An alternative strategy may be to set a global
trace limit that establishes a total budget of all traces across all distributors in the cluster.
The global limit is averaged across all distributors by using the distributor ring.
# /conf/tempo.yaml
overrides:
defaults:
ingestion:
[rate_strategy: <global|local> | default = local]
For example, this configuration specifies that each instance of the distributor will apply a limit of 15MB/s
.
overrides:
defaults:
ingestion:
strategy: local
limit_bytes: 15000000
This configuration specifies that together, all distributor instances will apply a limit of 15MB/s
.
So if there are 5 instances, each instance will apply a local limit of (15MB/s / 5) = 3MB/s
.
overrides:
defaults:
ingestion:
strategy: global
limit_bytes: 15000000
Usage-report
By default, Tempo will report anonymous usage data about the shape of a deployment to Grafana Labs. This data is used to determine how common the deployment of certain features are, if a feature flag has been enabled, and which replication factor or compression levels are used.
By providing information on how people use Tempo, usage reporting helps the Tempo team decide where to focus their development and documentation efforts. No private information is collected, and all reports are completely anonymous.
Reporting is controlled by a configuration option.
The following configuration values are used:
- Receivers enabled
- Frontend concurrency and version
- Storage cache, backend, WAL and block encodings
- Ring replication factor, and
kvstore
- Features toggles enabled
No performance data is collected.
You can disable the automatic reporting of this generic information using the following configuration:
usage_report:
reporting_enabled: false
If you are using a Helm chart, you can enable or disable usage reporting by changing the reportingEnabled
value.
This value is available in the tempo-distributed and the tempo Helm charts.
# -- If true, Tempo will report anonymous usage data about the shape of a deployment to Grafana Labs
reportingEnabled: true
Cache
Use this block to configure caches available throughout the application. Multiple caches can be created and assigned roles which determine how they are used by Tempo.
Example configuration:
cache:
background:
writeback_goroutines: 5
caches:
- roles:
- parquet-footer
memcached:
host: memcached-instance
- roles:
- bloom
redis:
endpoint: redis-instance