Components on Grafana Labs

Distributor

Thu, 28 May 2026 17:50:33 +0100

Distributor

The distributor is the entry point for all trace data into Tempo. It receives spans from instrumented applications and validates them against configured limits.

How the distributor forwards data depends on the deployment mode:

Microservices mode: The distributor shards traces by trace ID and writes them to Kafka. Downstream components including block-builders, live-stores, and metrics-generators each consume from Kafka independently.
Monolithic mode: The distributor pushes data in-process directly to the live-store and metrics-generator. No Kafka is required.

Receiving traces

The distributor uses the receiver layer from the OpenTelemetry Collector and accepts spans in multiple formats:

OpenTelemetry Protocol (OTLP) over gRPC and HTTP, which is the recommended format
Jaeger (Thrift and gRPC)
Zipkin
Kafka

We recommend using OTLP over gRPC when possible. Both Grafana Alloy and the OpenTelemetry Collector support OTLP export natively.

Validation and rate limiting

Before forwarding data, the distributor validates incoming data against configured ingestion limits. These are the only limits enforced synchronously at ingestion time.

The ingestion rate limit sets the maximum bytes per second per tenant. Exceeding this returns a RATE_LIMITED error to the client. The ingestion burst size controls the maximum burst allowed above the sustained rate. For details on which settings honor the global strategy and which are always local, refer to Ingestion rate strategy.

Other limits such as max_live_traces_bytes are enforced asynchronously downstream by live-stores, while max_bytes_per_trace is enforced downstream as well, including by block-builders in microservices mode.

When the distributor refuses spans due to rate limits, it increments the tempo_discarded_spans_total metric with a reason label indicating why.

Logging discarded spans

To log individual discarded spans for debugging:

distributor:
  log_discarded_spans:
    enabled: true
    include_all_attributes: false

Setting include_all_attributes: true produces more verbose logs that include span attributes, which can help identify misbehaving clients.

Writing to Kafka (microservices mode)

In microservices mode, after validation, the distributor shards traces by hashing the trace ID, looks up the partition ring to determine which Kafka partitions are active, and writes records to the appropriate partitions. It waits for Kafka to acknowledge the write before returning a response to the client.

The write is only considered successful after Kafka returns with success. This ensures that once the client gets a success response, the data is durably stored.

Partitioning

The distributor shards traces by trace ID, meaning all spans for the same trace go to the same Kafka partition. This has two benefits:

Block-builders can build blocks where all spans for a trace are co-located within a single consumption cycle.
Live-stores can serve complete traces from a single partition without cross-partition coordination.

The distributor uses the partition ring, not Kafka’s partition routing, to determine target partitions. This allows Tempo to control the partition lifecycle independently of Kafka.

In-process push (monolithic mode)

In monolithic mode, the distributor pushes trace data directly to the live-store and metrics-generator within the same process. No Kafka producer is initialized, and the distributor doesn’t use the partition ring for routing. The write is acknowledged to the client after the live-store accepts the data.

Key metrics

Metric	Description
`tempo_distributor_spans_received_total`	Total spans received by the distributor
`tempo_discarded_spans_total`	Spans discarded, labeled by `reason`
`tempo_distributor_bytes_received_total`	Total bytes received
`rate(tempo_distributor_spans_received_total[5m])`	Current ingestion rate in spans per second, derived in PromQL from the received spans counter

Refer to the distributor configuration for the full list of options.

Kafka

Thu, 28 May 2026 17:50:33 +0100

Kafka

In microservices mode, Tempo uses a Kafka-compatible message queue as the backbone of its write path. Any Kafka-compatible system works.

Kafka isn’t used in monolithic mode. In monolithic mode, the distributor pushes data in-process directly to the live-store and metrics-generator.

Role in the architecture

Kafka serves as a durable write-ahead log (WAL) between distributors and downstream consumers (block-builders, live-stores, and metrics-generators).

With Kafka, durability is centralized. Once Kafka acknowledges a write, the data is safe regardless of what happens to any Tempo component. Consumers are stateless—block-builders and live-stores can crash and restart, replaying from their last committed Kafka offset to rebuild state without data loss. Because Kafka provides durability, Tempo doesn’t need to replicate data across multiple instances on the write path, enabling a replication factor of 1 that significantly reduces storage costs.

Partitioning

Kafka topics are divided into partitions. Distributors hash the trace ID to determine the target partition. Each Kafka partition is consumed by exactly one block-builder instance and one live-store instance (per availability zone).

Tempo maintains its own partition ring that maps Tempo partitions to Kafka partitions. While these are typically 1:1, the partition ring is logically independent from Kafka’s partition metadata. Refer to the partition ring documentation for details.

Scaling partitions

The number of Kafka partitions determines the maximum parallelism for block-builders and live-stores. Each partition is owned by exactly one instance of each consumer type.

To scale block-builders or live-stores horizontally, you need at least as many partitions as instances. Adding Kafka partitions is a Kafka-side operation. Block-builders and live-stores use static partition assignment based on their instance ordinal, so scaling them requires adding both Kafka partitions and StatefulSet replicas together.

Consumer groups

Tempo runs multiple independent consumer groups against the same Kafka topic:

Consumer group	Component	Purpose
`block-builder`	Block-builder	Builds blocks for long-term storage
`live-store`	Live-store	Serves recent data for queries
`metrics-generator`	Metrics-generator	Derives metrics from trace data

Each consumer group tracks its own offsets. Block-builders and live-stores consume the same data independently and at their own pace. A slow block-builder doesn’t affect live-store availability, and vice versa.

Retention and offset management

Kafka’s retention policy determines how far back consumers can replay. Set it high enough to cover the block-builder’s consumption cycle time (plus buffer for failures and restarts) and the live-store’s replay window on startup.

If a consumer falls behind Kafka’s retention window, it loses the ability to replay missed data. Monitor consumer lag to avoid this situation.

Key metrics for monitoring consumer lag

tempo_ingest_group_partition_lag{group="<consumer-group>"}
tempo_ingest_group_partition_lag_seconds{group="<consumer-group>"}

tempo_ingest_group_partition_lag tracks lag in number of records per partition. tempo_ingest_group_partition_lag_seconds tracks lag in wall-clock seconds.

Configuration

Kafka connection settings are configured under the ingest section:

ingest:
  kafka:
    address: kafka:9092
    topic: tempo-traces

Refer to the ingest configuration for Kafka connection settings.

Block-builder

Thu, 28 May 2026 17:50:33 +0100

Block-builder

The block-builder is the write-path component responsible for building Parquet blocks and flushing them to object storage. It consumes trace data from Kafka and organizes it into blocks suitable for long-term retention and efficient querying.

The block-builder only runs in microservices mode. In monolithic mode, the live-store handles flushing trace data to object storage directly.

For a configuration block example, refer to the block-builder section of the Configuration documentation.

Consumption cycle

The block-builder operates on a cyclical consumption model.

On each cycle, the block-builder rewinds to the last committed Kafka offset to ensure any partially processed data from a previous cycle is re-consumed. It reads records from Kafka up to a configured boundary (time-based), organizes the consumed spans by tenant, writes them into Parquet blocks on local disk, uploads the blocks to object storage, and commits the Kafka offset.

Hard cuts

The block-builder performs a hard cut at the end of each consumption cycle. All spans consumed during that cycle are flushed into blocks, regardless of whether the traces they belong to are complete. If a trace has spans arriving across two consumption cycles, those spans end up in separate blocks.

This is by design. The block-builder has no concept of “live traces” or trace completion. Trace assembly is handled at query time by the querier, which merges spans from multiple blocks.

Block creation

Each consumption cycle produces one or more blocks per tenant per partition. Blocks are written in Apache Parquet format and contain the span data (data.parquet), block metadata (meta.json) including time range, tenant, and a replaces field for atomic block replacement, as well as bloom filters and indexes for efficient querying.

Span deduplication

During block creation, the block-builder deduplicates spans within each trace. Because the block-builder rewinds to the last committed Kafka offset on each cycle, replicated or re-consumed records can produce duplicate spans. The block-builder identifies duplicates using a combination of span ID and span kind, and removes them before writing the block.

Use the tempo_block_builder_spans_deduped_total metric (labeled by tenant) to track how many duplicate spans are removed.

Deterministic block IDs

The block-builder generates block IDs deterministically based on the partition, tenant, and Kafka offset range. This is critical for crash recovery: if a block-builder crashes mid-flush and restarts, it produces the same block IDs on retry, safely overwriting any partial data from the previous attempt.

Flush and recovery

The flush process supports safe replay at every stage.

Flush order

The block-builder flushes blocks to object storage in a specific order:

Bloom filters and indexes
data.parquet
nocompact.flg (a flag file that prevents compaction during the flush)
meta.json (the block becomes “live” at this point)

A block isn’t visible to the read path until its meta.json is written. Before that point, any crash is fully recoverable—the block-builder rewinds and overwrites.

Recovering from partial flushes

If the block-builder crashes before writing meta.json, the block is invisible to readers. On restart, it rewinds to the last committed offset, regenerates the same block ID, and overwrites the partial data.

If the crash happens after meta.json is written, the block is already live. On restart, the block-builder detects the existing block and advances to the next ID in sequence, using the replaces field to atomically replace the old block.

The `replaces` field

When a block-builder retries a flush and finds that a previous block already exists (its meta.json was written), the new block includes a replaces field in its meta.json listing the old block ID. This tells the read path to ignore the old block once the new one is visible, preventing duplicate data from appearing in query results.

The `nocompact.flg` file

The nocompact.flg file is written before meta.json to prevent backend workers from touching the block while it’s still being built. After the block-builder finishes its cycle, it removes this flag. This prevents a race condition where a backend worker might try to compact a block that’s about to be replaced.

Scaling

Each block-builder instance consumes from one or more Kafka partitions. The maximum number of block-builder instances equals the number of Kafka partitions.

Block-builders use static partition assignment. Kafka does not move partitions between consumers in the consumer group for this component. There are two ways to assign partitions:

partitions_per_instance: Each instance computes which partitions it owns based on its ordinal ID. This is the default and works well with StatefulSets where the block-builder mirrors its replica count from the live-store, scaling in lockstep.
assigned_partitions: An explicit mapping of instance IDs to partition lists. This gives full manual control over which instance handles which partitions.

Size the scratch disk to hold at least one full consumption cycle’s worth of data across all assigned partitions and tenants.

Key metrics

Metric	Description
`tempo_block_builder_flushed_blocks`	Number of blocks flushed to object storage
`tempo_block_builder_spans_deduped_total`	Duplicate spans removed during block creation, by tenant
`tempo_block_builder_fetch_errors_total`	Kafka fetch errors encountered
`tempo_ingest_group_partition_lag{group="block-builder"}`	Consumer lag per partition

Refer to the block-builder section of the Tempo configuration for the full list of block-builder options.

Live-store

Thu, 28 May 2026 17:50:33 +0100

Live-store

The live-store is the read-path component responsible for serving recent trace data. It holds traces in memory, making them available for queries during the window between ingestion and block availability in object storage.

How the live-store receives data depends on the deployment mode:

Microservices mode: The live-store consumes trace data from Kafka independently of block-builders.
Monolithic mode: The live-store receives trace data directly from the distributor in-process. No Kafka consumption is involved.

Why live-stores exist

In microservices mode, there’s a gap between when trace data is written to Kafka and when the block-builder flushes it to object storage. During this window, the only way to query that data is through the live-store.

In monolithic mode, the live-store serves the same role of providing immediate query access to recently ingested data, but it receives data directly from the distributor rather than from Kafka.

In both modes, the live-store holds traces in memory organized by trace ID, responds to queries from queriers for recent data, and periodically flushes traces to a local WAL in Parquet format for TraceQL search and metrics queries.

Trace lifecycle

When the live-store receives spans, it assembles them into traces in memory. Each trace goes through three stages.

First, the trace is active—it’s receiving spans, remains in memory, and is queryable. Then, when no new spans have arrived within the configured max_trace_idle the trace becomes idle and is flushed to the local WAL. Once flushed, the trace data is written in Parquet format and becomes available for TraceQL search. Eventually, the WAL data is cut into complete blocks.

Trace idle period

The max_trace_idle setting controls how long the live-store waits after the last span arrives before considering a trace idle and flushing it to the WAL.

live_store:
  max_trace_idle: 10s

Increasing this value keeps traces in memory longer, which improves the chances that all spans for a trace are co-located when flushed. This is beneficial for long-running traces. However, it also increases memory usage.

Partition ownership

Live-stores own the partition lifecycle within Tempo. Each live-store instance consumes from one or more Tempo partitions, and each partition is owned by exactly one live-store per availability zone.

Partition ring

The live-store maintains a partition ring that tracks which Tempo partitions exist, which live-stores own each partition, and the state of each partition (pending, active, or inactive).

This ring is propagated via memberlist gossip. Refer to the partition ring documentation for details on partition states and transitions.

Startup

When a live-store starts, it checks the partition ring for its assigned partition. If the partition exists, the live-store joins as an owner. If it doesn’t exist, the live-store creates it in pending state and waits for enough owners to register before automatically promoting it to active. In microservices mode, the live-store then replays from its last committed Kafka offset to rebuild in-memory state.

Shutdown and scaling down

Scaling down live-stores requires marking the partition as inactive while the live-store is still running. This transitions the partition to read-only mode. After enough time passes for the data to be flushed to object storage, you can safely remove the partition and live-store.

Abruptly removing a live-store without marking its partition inactive makes that partition’s recent data temporarily unavailable until another live-store picks it up (in a zone-aware setup, the other zone’s live-store continues serving).

Zone-aware high availability

For production deployments, live-stores are typically deployed across multiple availability zones. Each Tempo partition is owned by one live-store per zone.

If a live-store in one zone becomes unavailable, the live-store in the other zone continues serving queries for the same partitions. Queriers only need a response from one live-store per partition (read quorum of 1), so queries succeed as long as at least one zone is healthy. This provides high availability without requiring data deduplication on the read path.

Refer to the zone-aware live-stores documentation for configuration details.

Local WAL

When traces are flushed from memory, they’re written to a local WAL in Parquet format. This serves two purposes.

First, it provides search availability—after data is in the WAL, trace data is available for TraceQL search queries, not just trace ID lookups. Second, it aids recovery on restart. In microservices mode, if the live-store restarts, it replays from Kafka, and the WAL provides a way to serve queries during replay.

The WAL is eventually cut into complete blocks that are also stored locally. These blocks are queryable until the data ages out of the live-store’s retention window.

Key metrics

Metric	Description
`tempo_live_store_traces_created_total`	Total number of traces created in the live-store
`tempo_live_store_lagged_requests_total`	Requests where the live-store could not guarantee complete results due to Kafka lag, labeled by `route`
`tempo_warnings_total`	Warnings during trace processing, labeled by `reason`
`tempo_ingest_group_partition_lag{group="live-store"}`	Consumer lag per partition

Refer to the live-store configuration for the full list of options.

Query frontend

Thu, 28 May 2026 17:50:33 +0100

Query frontend

The query frontend is the entry point for all queries in Tempo. It receives TraceQL queries and trace ID lookups, shards them into parallel jobs, and distributes those jobs to queriers for execution.

How it works

The query frontend handles the full lifecycle of a query. It shards a single query into many smaller jobs, each covering a subset of the data (for example, a subset of blocks or a time range). Jobs are placed in a per-tenant queue and dispatched to queriers in batches, reducing round-trip overhead. As queriers return partial results, the frontend merges and deduplicates them into a final response. If a querier fails to process a job, the frontend retries it on another querier. For search queries with a result limit, the frontend cancels remaining jobs as soon as enough results are collected.

Job sharding

The frontend uses target_bytes_per_job to estimate how large each job should be. Smaller values create more, smaller jobs (higher parallelism but more overhead). Larger values create fewer, bigger jobs (less overhead but lower parallelism).

The total number of jobs for a query depends on the time range, the volume of data in that range, and the target_bytes_per_job setting.

Concurrent jobs

The concurrent_jobs setting controls how many jobs for a single query are dispatched to the queue at once. If a query produces 5,000 jobs and concurrent_jobs is 1,000, only 1,000 jobs are active at a time. As jobs complete, new ones are dispatched.

This limits the blast radius of a single large query. In shared clusters, keeping this value lower ensures fair scheduling across tenants.

Querier connections

Queriers connect to the query frontend over streaming gRPC. Each connection processes one batch at a time synchronously. The number of concurrent connections from a querier determines how many batches it can process in parallel.

This is controlled by either querier.max_concurrent_queries (maximum total concurrent jobs per querier) or querier.frontend_worker.parallelism (number of connections per query frontend).

Key configuration

query_frontend:
  max_outstanding_per_tenant: 2000  # Max jobs in queue per tenant
  max_batch_size: 7                 # Jobs per batch sent to querier
  max_retries: 2                    # Retry count for failed jobs
  search:
    concurrent_jobs: 2000           # Max concurrent jobs per query
    target_bytes_per_job: 104857600 # ~100MB per job

Refer to Tune search performance for detailed tuning guidance.

Key metrics

Metric	Description
`tempo_query_frontend_queries_total`	Total queries received
`tempo_query_frontend_queue_length`	Current queue depth per tenant

Refer to the query-frontend configuration for the full list of options.

Querier

Thu, 28 May 2026 17:50:33 +0100

Querier

The querier is the worker component that executes query jobs dispatched by the query frontend. It fetches trace data from both live-stores (for recent data) and object storage (for historical data), then returns results to the query frontend for merging.

Why the querier exists

Trace data in Tempo lives in two places: recent data in live-stores and historical data in object storage blocks. The querier bridges both sources, fetching and merging data so that the query frontend doesn’t need to know where data lives. This separation lets you scale query execution independently from query planning and result merging.

Query execution

When a querier receives a batch of jobs from the query frontend, it processes each job by determining where the relevant data lives.

For recent data, the querier contacts live-stores that own the partitions covering the query’s time range. Live-stores respond with any matching spans held in memory or their local WAL.

For historical data, the querier reads block metadata from the blocklist, identifies which blocks may contain matching data, and fetches the relevant portions from object storage. Bloom filters efficiently skip blocks that don’t contain the requested trace IDs.

Results from both sources are combined and returned to the query frontend.

Live-store queries

The querier uses the partition ring to determine which live-stores to contact for a given query. For zone-aware deployments, the querier only needs a response from one live-store per partition (read quorum of 1).

If a live-store is unavailable, the querier falls back to the live-store in the other availability zone. If no live-store is available for a partition, recent data for that partition is temporarily unavailable, but historical queries still work.

Backend queries

For historical data, the querier consults the blocklist (maintained by backend workers) to find blocks in the relevant time range. It uses bloom filters to quickly eliminate blocks that don’t contain the target trace ID, fetches matching block data from object storage (using caching where configured), reads the Parquet data, and applies any TraceQL filters.

Caching

Queriers benefit significantly from caching. Tempo supports multiple cache tiers.

The frontend search cache caches query results at the frontend level. It has a low hit rate and is mainly useful for repeated queries. The Parquet page cache caches individual Parquet pages with a high hit rate, useful across many different queries. The bloom filter cache caches bloom filters used for trace ID lookups, also with a high hit rate.

Lower-level caches (bloom, Parquet page) have higher hit rates and should be sized more generously than higher-level caches.

Concurrency

The number of jobs a querier processes concurrently is controlled by max_concurrent_queries (the maximum number of jobs processed at once) or frontend_worker.parallelism (the number of connections to each query frontend, which determines concurrent batch processing).

Increasing concurrency makes queriers process more jobs in parallel but increases memory usage. If queriers run out of memory, reduce concurrency and scale horizontally instead.

Memory sizing

Querier memory usage roughly scales with: job_size * querier_concurrency + buffer. You can tune this by adjusting target_bytes_per_job (at the frontend), max_concurrent_queries (at the querier), and frontend_worker.parallelism (which affects how many batches the querier processes at once).

Refer to the querier configuration for the full list of options.

Compaction

Thu, 28 May 2026 17:50:33 +0100

Compaction

The backend scheduler and worker replace the legacy compactor. Together, they handle compaction, retention, and blocklist maintenance for data in object storage.

How it works

The backend scheduler creates jobs and assigns them to workers. Workers connect to the scheduler via gRPC, request jobs, execute them, and report results back. This split makes compaction horizontally scalable—you can add workers to increase throughput without changing the scheduler.

Job types

The scheduler produces three types of jobs:

Compaction: merges small blocks into larger ones to reduce the number of blocks queriers need to scan and improve query performance.
Retention: deletes blocks older than the configured retention period.
Redaction: rewrites blocks to remove matching trace data from object storage.

Job lifecycle

The scheduler uses providers to generate jobs. Each provider runs independently and feeds jobs into a shared channel.

The compaction provider periodically measures tenants and produces compaction jobs based on the blocklist.
The retention provider produces retention jobs on a schedule.
The redaction provider drains a persistent queue of pending redaction requests. The scheduler’s rescan logic handles waiting for any compaction jobs that were active at submission time to complete before the rewritten blocks become eligible for querying.

When a worker calls Next, the scheduler assigns an available job and persists the assignment to a local work cache. The worker executes the job and calls UpdateJob with a success or failure status. On success, the scheduler applies the results to the in-memory blocklist (for example, marking compacted blocks as removed). The work cache is periodically flushed to object storage for crash recovery.

Backend scheduler

The scheduler is a singleton: only one instance should run at a time. It maintains the work cache, which tracks all active and completed jobs, and polls object storage to keep the blocklist up to date.

The scheduler exposes an HTTP status endpoint that lists all known jobs with their status, tenant, worker assignment, and timestamps.

backend_scheduler:
  maintenance_interval: 1m
  backend_flush_interval: 1m

Backend worker

Workers are stateless job executors. Each worker connects to the scheduler, requests a job, processes it, and reports back. Multiple workers can run in parallel.

Workers also maintain the blocklist for all tenants. Tenant polling is coordinated through a ring, so each worker polls a subset of tenants. This distributes the load of scanning object storage across all workers.

Workers use a ring for tenant sharding. The ring determines which worker is responsible for polling each tenant’s blocklist. By default the ring is disabled, meaning each worker polls all tenants without sharding.

backend_worker:
  backend_scheduler_addr: backend-scheduler:9095
  finish_on_shutdown_timeout: 30s

Graceful shutdown

When a worker receives a shutdown signal, it has a configurable timeout (finish_on_shutdown_timeout) to complete the current job before being terminated. This prevents partially completed jobs from being left in an inconsistent state.

Scheduler status API

The backend scheduler exposes an HTTP endpoint that shows the current state of all jobs:

GET /status/backendscheduler

The response is a plain-text table with two sections:

Active Jobs: all jobs in the scheduler work cache, sorted by creation time. This includes jobs in any state – use the status column to interpret each row. A non-empty worker field indicates the job is currently assigned to a worker.
Pending Jobs: redaction jobs in the pending queue. Some may already be eligible to run; others may still be waiting for the rescan or compaction preconditions to clear.

This endpoint is useful for diagnosing stalled jobs, verifying that workers are consuming work, and checking whether a redaction request has been processed.

Key metrics

Metric	Description
`tempodb_compaction_blocks_total`	Blocks compacted
`tempodb_compaction_bytes_written_total`	Bytes written during compaction
`tempodb_retention_marked_for_deletion_total`	Blocks marked for deletion by retention
`tempodb_retention_deleted_total`	Blocks deleted by retention
`tempo_backend_scheduler_jobs_created_total`	Jobs created
`tempo_backend_scheduler_jobs_completed_total`	Jobs completed successfully
`tempo_backend_scheduler_jobs_failed_total`	Jobs that failed
`tempo_backend_scheduler_jobs_active`	Jobs currently assigned to a worker
`tempo_backend_scheduler_job_duration_seconds`	Job execution duration histogram
`tempodb_blocklist_length`	Number of live blocks per tenant; high values indicate compaction is falling behind
`tempodb_compaction_outstanding_blocks`	Outstanding blocks awaiting compaction per tenant; the primary autoscaling signal

Most scheduler job metrics carry tenant and job_type labels; tempo_backend_scheduler_job_duration_seconds carries only job_type. The job_type label uses protobuf enum string values: JOB_TYPE_COMPACTION, JOB_TYPE_RETENTION, and JOB_TYPE_REDACTION. The duration histogram measures elapsed time from job creation to completion, not execution time alone.

Monitoring

The Tempo mixin ships a pre-built Grafana dashboard, Tempo - Backend Work, that covers:

Blocklist length and poll duration
Active, completed, failed, and retried job counts
Compaction throughput (objects written, bytes written, blocks compacted)
Outstanding blocks per tenant
CPU and memory for both the backend scheduler and backend workers
A backend-worker autoscaling panel

To use the dashboard, install the Tempo mixin from operations/tempo-mixin/ and import the generated dashboard into your Grafana instance.

Compaction operations for timing requirements and block selection details.
Configuration reference for the full list of options.

Metrics-generator

Thu, 28 May 2026 17:50:33 +0100

Metrics-generator

The metrics-generator is an optional component that derives metrics from trace data, which are then remote-written to a metrics backend, for example, Prometheus or Grafana Mimir.

How the metrics-generator receives data depends on the deployment mode:

Microservices mode: The metrics-generator consumes trace data from Kafka as an independent consumer group.
Monolithic mode: The metrics-generator receives trace data directly from the distributor in-process. No Kafka consumption is involved.

Why it matters

Traces contain rich information about service interactions, latencies, and error rates. The metrics-generator extracts this information and produces time-series metrics, enabling alerting and Grafana dashboards without requiring separate instrumentation.

It supports two types of metric generation. Span metrics produce request rate, error rate, and duration (RED) metrics from individual spans. These can be broken down by service, operation, status code, and custom dimensions extracted from span attributes. Service graphs build a graph of service-to-service communication by matching client and server spans, producing metrics for request rates, error rates, and latencies between service pairs.

Kafka consumption

In microservices mode, the metrics-generator consumes trace data directly from Kafka, like live-stores and block-builders. It runs as an independent consumer group, tracking its own offsets separately.

Monitoring consumption

Use the following metrics to verify the generator is consuming data:

tempo_ingest_group_partition_lag{group="metrics-generator"}
tempo_ingest_group_partition_lag_seconds{group="metrics-generator"}

High or growing lag indicates the generator is falling behind. The tempo_ingest_storage_reader family of metrics exposes detailed information about fetch operations and errors from the Kafka client library.

Active series limiting

The generator protects itself and downstream metrics storage with configurable limits.

Series-based limiting

You can cap the total number of active time series the generator produces:

overrides:
  defaults:
    metrics_generator:
      max_active_series: 0  # 0 = unlimited

This value is per metrics-generator instance. The actual maximum across the cluster is <instances> * max_active_series.

When the limit is reached, the generator produces overflow series with the label metric_overflow="true" instead of dropping data entirely. As existing series become stale, new series split out from the overflow bucket.

Entity-based limiting

Entity-based limiting is an alternative to series-based limiting. An entity is a unique label combination (excluding external labels) across multiple metrics. Entity limiting ensures the generator always produces the full set of metrics for a given entity rather than limiting randomly.

metrics_generator:
  limiter_type: entity

Per-label cardinality limiting

You can cap the number of distinct values a single label can have. When exceeded, new values are replaced with __cardinality_overflow__ while other labels remain unaffected.

overrides:
  defaults:
    metrics_generator:
      max_cardinality_per_label: 0  # 0 = disabled

Remote write

The generator writes metrics to one or more remote-write endpoints. Monitor write health with:

prometheus_remote_storage_samples_failed_total
prometheus_remote_storage_samples_dropped_total

Refer to the metrics-generator documentation for configuration and usage details.

Components on Grafana Labs

Distributor

Distributor

Receiving traces

Validation and rate limiting

Logging discarded spans

Writing to Kafka (microservices mode)

Partitioning

In-process push (monolithic mode)

Key metrics

Related resources

Kafka

Kafka

Role in the architecture

Partitioning

Scaling partitions

Consumer groups

Retention and offset management

Key metrics for monitoring consumer lag

Configuration

Related resources

Block-builder

Block-builder

Consumption cycle

Hard cuts

Block creation

Span deduplication

Deterministic block IDs

Flush and recovery

Flush order

Recovering from partial flushes

The replaces field

The nocompact.flg file

Scaling

Key metrics

Related resources

Live-store

Live-store

Why live-stores exist

Trace lifecycle

Trace idle period

Partition ownership

Partition ring

Startup

Shutdown and scaling down

Zone-aware high availability

Local WAL

Key metrics

Related resources

Query frontend

Query frontend

How it works

Job sharding

Concurrent jobs

Querier connections

Key configuration

Key metrics

Related resources

Querier

Querier

Why the querier exists

Query execution

Live-store queries

Backend queries

Caching

Concurrency

Memory sizing

Related resources

Compaction

Compaction

How it works

Job types

Job lifecycle

Backend scheduler

Backend worker

Graceful shutdown

Scheduler status API

Key metrics

Monitoring

Related resources

The `replaces` field

The `nocompact.flg` file