Tempo architecture reference on Grafana Labs

About the Tempo architecture

Thu, 28 May 2026 17:50:33 +0100

About the Tempo architecture

Grafana Tempo is a distributed tracing backend designed for high-volume trace ingestion and querying at scale. Tempo 3.0 introduces a new architecture that decouples the write and read paths. In microservices mode, a Kafka-compatible message queue serves as a durable intermediary between the distributor and downstream consumers. In monolithic mode, the distributor pushes data directly to the live-store in-process without Kafka.

Design philosophy

The Tempo architecture is built around several key principles.

Separate components handle writing trace data to storage and serving queries. You can scale writes and reads independently, and a failure in one path doesn’t affect the other.

In microservices mode, Kafka serves as a durable write-ahead log (WAL) between distributors and downstream consumers. Once Kafka acknowledges a write, the data is safe. This replaces the previous in-process ingestion WAL that lived on local disks. Because Kafka provides durability on the write path, Tempo doesn’t need to replicate data across multiple instances. This replication factor of 1 significantly reduces cost and query complexity.

In monolithic mode, no Kafka is required. The distributor pushes trace data in-process directly to the live-store and metrics-generator. Live-stores still use a local WAL for quickly available search.

Tempo uses Apache Parquet as the default columnar block format, storing trace data in a columnar layout that enables efficient querying of specific attributes without reading entire traces.

Refer to Apache Parquet block format configuration and Apache Parquet schema to learn more.

Write path

The write path gets trace data from instrumented applications into long-term object storage. How data flows through the write path depends on the deployment mode.

Microservices write path

Distributors receive trace data over OTLP or other supported protocols, validate it against rate limits, shard traces by trace ID, and write records to Kafka partitions.
Kafka durably stores the records. The write is acknowledged to the client as soon as Kafka confirms receipt.
Block-builders consume records from Kafka, organize spans into blocks in Apache Parquet format, and flush those blocks to object storage.

The block-builder operates on a consumption cycle: it reads a batch of records from Kafka, builds blocks from them, flushes the blocks to object storage, and commits the offset back to Kafka. Each cycle produces a clean cut of data. Traces that span multiple cycles have their spans split across blocks, which the query path handles at query time.

Monolithic write path

In monolithic mode, no Kafka or block-builder is involved. The distributor pushes trace data in-process directly to the live-store and metrics-generator. The live-store holds traces in memory, flushes them to a local WAL, cuts them into completed blocks, and flushes those completed blocks to the configured storage backend.

Read path

The read path serves queries by combining recent data from live-stores with historical data from object storage.

The query frontend receives a query, shards it into parallel jobs, and distributes them to queriers.
Queriers execute jobs by fetching data from two sources: live-stores for recent data, typically the last 30 minutes to 1 hour, and object storage for historical data, using bloom filters and indexes for efficient block lookups.
The query frontend merges results from all queriers and returns the response.

Live-stores and the recent data window

Live-stores are the read-path component responsible for serving recent trace data. They hold traces in memory and write them to temporary on-disk blocks, making data available for queries within seconds of ingestion.

In microservices mode, live-stores consume from Kafka independently of block-builders. There’s a gap between when trace data is written to Kafka and when the block-builder flushes it to object storage. During this window, the only way to query that data is through the live-store.

In monolithic mode, live-stores receive trace data directly from the distributor in-process. There is no Kafka consumption or block-builder involved.

In microservices mode, live-stores own the partition lifecycle within Tempo. They manage a partition ring that tracks which partitions are active and which live-stores own them. This is separate from Kafka’s internal partition management. Refer to the partition ring documentation for details.

How the paths connect

The write and read paths connect through object storage. In microservices mode, block-builders write blocks there; in monolithic mode, live-stores write blocks there. Queriers read from object storage in both modes.

In microservices mode, Kafka also connects the paths. Both block-builders and live-stores consume from the same Kafka partitions, but they track their own consumer offsets independently. Even if a block-builder is down or slow, live-stores continue serving recent data. If a live-store restarts, it replays from Kafka to rebuild its in-memory state.

In monolithic mode, the connection is simpler. The distributor pushes data directly to the live-store, which flushes blocks to object storage. The querier reads from both the live-store and object storage within the same process.

Component summary

Component	Path	Microservices mode	Monolithic mode
Distributor	Write	Receives traces, validates limits, writes to Kafka	Receives traces, validates limits, pushes in-process to live-store
Kafka	Write	Durable message queue between distributor and consumers	Not used
Block-builder	Write	Consumes from Kafka, builds Parquet blocks, flushes to object storage	Not used
Live-store	Read	Consumes from Kafka, serves recent data to queriers	Receives data from distributor, serves recent data to queriers
Query frontend	Read	Shards queries into jobs, distributes to queriers, merges results	Same
Querier	Read	Executes query jobs against live-stores and object storage	Same
Backend scheduler/worker	Maintenance	Compacts and deduplicates blocks, enforces retention	Same
Metrics-generator	Optional	Consumes from Kafka, derives metrics from traces	Receives data from distributor, derives metrics from traces

Components

Thu, 28 May 2026 17:50:33 +0100

Components

Tempo is composed of several components, each responsible for a specific part of the trace lifecycle. All components are compiled into the same binary, and the -target parameter controls which component runs.

This section documents each component in detail, including its responsibilities, configuration, failure modes, and relevant metrics.

Partition ring

Thu, 28 May 2026 17:50:33 +0100

Partition ring

The partition ring is the mechanism Tempo uses to track which partitions exist, their current state, and which components own them. By default, the partition ring propagates across the cluster via memberlist gossip and is central to how distributors, live-stores, and block-builders coordinate.

Tempo partitions vs Kafka partitions

Tempo maintains its own concept of partitions that are logically distinct from Kafka partitions. While there’s typically a 1:1 mapping, the partition ring gives Tempo independent control over partition states (pending, active, inactive), ownership (which live-store owns each partition), and lifecycle management (creating, activating, and deactivating partitions without modifying Kafka’s configuration).

Partition states

Each partition in the ring has one of three states.

Pending

Pending is the initial state when a new partition is created. No reads or writes occur.

A partition enters pending state when a new live-store starts and creates a partition that doesn’t yet exist in the ring. It stays in pending until enough owners have registered and a minimum waiting period has elapsed, at which point the owning live-store automatically promotes it to active.

Active

Active is the normal operating state. Distributors write data to active partitions, and queriers read from them.

A partition transitions from pending to active once enough owners have registered for that partition and a configurable waiting duration has elapsed. This ensures that all availability zones have had time to register their live-store instances before traffic starts flowing.

Inactive

Inactive is the read-only state. Distributors stop writing to inactive partitions, but queriers can still read from them.

A partition is marked inactive when scaling down. It must remain in this state long enough for the block-builder to flush all remaining data for this partition to object storage, and for queriers to stop relying on the live-store for this partition’s recent data.

After this grace period, you can safely remove the partition and its owning live-store.

Ownership model

Live-stores

Each Tempo partition is owned by one live-store per availability zone. In a zone-aware deployment with two zones, each partition has two owners—one per zone. Both consume the same Kafka partition independently.

When a live-store starts, it checks the ring for its assigned partition. If the partition exists, it joins as an owner. If not, it creates the partition in pending state and waits for enough owners to register.

Distributors

Distributors read the partition ring to determine which partitions are active. They only send data to active partitions. The ring tells distributors which Kafka partitions to write to.

Block-builders

Each block-builder instance computes which Kafka partitions it owns based on its ordinal ID and the partitions_per_instance setting. The partition ring indirectly affects block-builders because it determines which partitions receive data from distributors.

Scaling

Scaling up

To scale up, deploy a new live-store instance. The live-store creates a new partition in the ring (pending state). After enough owners register and the waiting period elapses, the partition transitions to active, and distributors begin writing to the new partition.

A corresponding Kafka partition must exist. Add Kafka partitions first if needed.

Scaling down

To scale down, mark the target partition as inactive while the live-store is still running. Distributors stop writing to it. Wait for the block-builder to flush remaining data to object storage, then remove the live-store instance. The partition is eventually cleaned up from the ring.

Skipping the inactive step and abruptly removing a live-store causes recent data for that partition to become temporarily unavailable (unless a zone-aware replica exists).

Memberlist propagation

The partition ring state is propagated using memberlist, which uses a gossip protocol. Changes to the ring (new partitions, state transitions) propagate across the cluster within seconds under normal conditions.

During network partitions or high cluster churn, propagation may be delayed. This can cause brief inconsistencies where different components have different views of the ring. Tempo handles this gracefully: distributors write to a partition that a live-store hasn’t yet seen results in data that’s picked up after the live-store catches up, and queriers contacting a live-store for a partition it doesn’t own yet get an empty response, with the data eventually available from another live-store or from object storage.

Refer to the memberlist configuration for ring propagation settings and the ingest configuration for partition-related settings.

Deployment modes

Thu, 28 May 2026 17:50:33 +0100

Deployment modes

Tempo can be deployed in monolithic or microservices mode. Microservices mode requires a Kafka-compatible system. Monolithic mode doesn’t use Kafka.

Monolithic mode was previously called single binary mode.

Note
The previous scalable monolithic mode, also known as scalable single binary mode or SSB, has been removed in v3.0.

All components are compiled into the same binary. The -target command-line parameter, or target in configuration, determines which components run in a given process. The default target is all, which is the monolithic deployment mode.

tempo -target=all

Refer to the command line flags documentation for more information on the -target flag.

Monolithic mode

In monolithic mode, the required components run in a single process using -target=all, which is the default. Components that are only needed in microservices mode, such as the block-builder, are excluded.

No Kafka is required. The distributor pushes trace data in-process directly to the live-store and metrics-generator. Traces are flushed to the configured storage backend without an intermediate message queue. Object storage is recommended for production deployments.

When to use monolithic mode

Monolithic mode is suitable for getting started, development environments, and low to moderate trace volumes where operational simplicity matters more than independent scaling.

Limitations

Components share the same resource pool. A spike in query load can affect write throughput and vice versa. There is no independent scaling. You can run multiple monolithic instances, but each instance runs the same set of components. At higher volumes, memory pressure from collocated components, particularly the live-store and querier, can cause out-of-memory issues.

Resource considerations

Monolithic instances need enough memory to handle the live-store’s in-memory trace buffer, the querier’s concurrent job execution, and the backend worker’s memory for block merging. As volume increases, the instance is limited by whichever component is most resource-hungry.

Example

Refer to Docker Compose examples in the Tempo repository for sample deployments.

For an annotated example configuration for Tempo, refer to the Introduction to MLTP repository, which includes a sample tempo.yaml for a monolithic instance.

Microservices mode

In microservices mode, each component runs as a separate process with its own -target. This is the recommended mode for production.

The configuration associated with each component’s deployment specifies a target. For example, to deploy a querier, the configuration would contain target: querier. A command-line deployment may specify the -target=querier flag.

When to use microservices mode

Use microservices mode for production deployments, high trace volumes requiring independent scaling, and environments where high availability is important.

Advantages

Microservices mode provides independent scaling. You can scale block-builders for write throughput, queriers for query performance, and live-stores for recent data capacity, all independently. Failure domains are isolated: a querier OOM doesn’t affect data ingestion, and a block-builder restart doesn’t affect query availability. Live-stores can be deployed across availability zones for high availability. Each component gets exactly the resources it needs, avoiding the over-provisioning required in monolithic mode.

Component scaling guidelines

Component	Scaling strategy	Notes
Distributor	Horizontal	Stateless. Scale based on ingestion rate.
Block-builder	Horizontal	Bounded by Kafka partition count. Scale based on data volume.
Live-store	Horizontal	Bounded by Kafka partition count. Scale based on recent data query volume and memory.
Query frontend	Vertical	Keep to 2 replicas. Scale up CPU/RAM rather than adding replicas.
Querier	Horizontal	Scale based on query concurrency and latency requirements.
Backend worker	Horizontal	Handles compaction and retention. Scale based on block count and compaction lag.
Metrics-generator	Horizontal	Scale based on trace volume and number of generated series.

Kafka as the connecting fabric

In microservices mode, Kafka is the primary communication channel for the write path. Components don’t communicate directly for data transfer. Distributors write to Kafka. Block-builders, live-stores, and metrics-generators each consume from Kafka independently. Queriers contact live-stores over gRPC for recent data. All components access object storage for block data.

Adding or removing instances of any component doesn’t require reconfiguring other components, aside from Kafka partition management.

Example

Refer to the distributed Docker Compose example in the Tempo repository.

Components by deployment mode

Not all components and configuration blocks apply to both modes. The following table summarizes which components run in each mode and how shared components differ.

Component	Config block	Monolithic	Microservices
Distributor	`distributor`	Pushes data in-process to the live-store and metrics-generator	Writes data to Kafka
Ingest	`ingest`	Not used	Kafka connection settings for the write path
Block-builder	`block_builder`	Not used	Consumes from Kafka, builds Parquet blocks, flushes to object storage
Live-store	`live_store`	Receives data directly from the distributor	Consumes from Kafka
Live-store client	`live_store_client`	Querier-to-live-store client (runs in-process)	gRPC client for querier-to-live-store communication
Query-frontend	`query_frontend`	Runs in-process	Runs as a separate process
Querier	`querier`	Runs in-process	Runs as a separate process
Backend scheduler	`backend_scheduler`	Runs in-process	Runs as a separate process
Backend worker	`backend_worker`	Runs in-process	Runs as a separate process
Metrics-generator	`metrics_generator`	Optional, runs in-process	Optional, runs as a separate process
Storage	`storage`	Storage backend for trace data (object storage recommended; local supported for dev/test)	Object storage for trace data
Memberlist	`memberlist`	Cluster membership	Cluster membership
Overrides	`overrides`	Per-tenant limits	Per-tenant limits

For full configuration details, refer to the configuration reference.

Migrating between modes

Moving from monolithic to microservices mode involves deploying individual components with appropriate -target flags, pointing all components at the same Kafka cluster, object storage, and memberlist, then scaling down the monolithic instances.

Since Kafka provides durability, no data is lost during the transition. Live-stores replay from Kafka on startup, and block-builders continue from their last committed offset.

Refer to Plan your Tempo deployment for deployment planning and sizing guidance.

Block format

Thu, 28 May 2026 17:50:33 +0100

Block format

Tempo stores trace data in Apache Parquet format. Parquet is a columnar storage format that enables efficient querying of specific attributes without reading entire traces.

Why Parquet

Columnar storage is a natural fit for trace data. A TraceQL query like { span.http.status_code = 500 } only needs to read the http.status_code column, not the entire trace. This dramatically reduces the amount of data read from storage. Columnar data also compresses well because values in a column tend to be similar (for example, all service names or all status codes).

Block structure

A block is a directory in object storage containing several files.

File	Purpose
`meta.json`	Block metadata: time range, tenant, block ID, `replaces` field for atomic replacement
`data.parquet`	Trace data in columnar format
Bloom filters	Probabilistic data structures for efficient trace ID lookups
Index	Maps trace IDs to row groups within `data.parquet`
`nocompact.flg`	Temporary flag preventing compaction (present during block-builder flushes)

Blocks are stored under <tenant-id>/<block-id>/ in object storage.

`meta.json`

The meta.json file makes a block visible to the read path. It contains the block ID, tenant ID, start and end timestamps, total number of objects (traces).

A block is invisible to queriers until meta.json exists. The block-builder uses this property to ensure atomicity during flushes.

Schema

Tempo uses a span-oriented heavily nested Parquet schema. Each row represents a span, with columns for intrinsic fields (trace ID, span ID, parent span ID, span name, span kind, span status, duration, start time, root service name, root span name), resource attributes (for example, service.name, deployment.environment), and span attributes (for example, http.method, http.status_code).

Intrinsic fields vs generic attributes

A small number of fields are stored as top-level columns in the schema. These include service.name on resources and status.code on spans. Querying these fields is always efficient because they have their own Parquet columns.

All other attributes (both resource and span) are stored in a generic Attrs column as key-value pairs. Querying these requires scanning the attribute map, which is slower.

Dedicated attribute columns

You can promote frequently queried attributes from the generic column to their own dedicated Parquet columns for better query performance. This is configured centrally in storage.trace.block and applies to all block-producing components (live-store and block-builder). Dedicated columns are assigned dynamically per block based on the configuration at the time the block is built.

storage:
  trace:
    block:
      parquet_dedicated_columns:
        - name: http.method
          type: string
          scope: span
        - name: deployment.environment
          type: string
          scope: resource

Refer to Dedicated attribute columns for configuration details and recommendations.

Block format versions

Tempo uses versioned block formats.

Version	Status
vParquet3	Deprecated in 2.10, removed in 3.0
vParquet4	Default and latest in Tempo 3.0
vParquet5	Production-ready, opt-in

The block format is configured in:

storage:
  trace:
    block:
      version: vParquet4

Existing blocks in older formats remain readable. New blocks are always written in the configured version.

Refer to the Apache Parquet schema documentation for the full schema details and the Apache Parquet block format configuration for configuration options.

Object storage

Thu, 28 May 2026 17:50:33 +0100

Object storage

Object storage is the long-term storage backend for all trace data in Tempo. Block-builders write blocks to it, queriers read from it, and backend workers maintain it.

Supported backends

Tempo supports three major object storage APIs:

Amazon S3 (and S3-compatible systems, for example, MinIO)
Google Cloud Storage (GCS)
Microsoft Azure Blob Storage

A local filesystem backend is also available for development and testing.

Storage layout

Data in object storage is organized by tenant and block:

<bucket>/
  <tenant-id>/
    <block-id>/
      meta.json
      data.parquet
      bloom-0
      bloom-1
      ...
      index

Each tenant has its own directory. Within a tenant, each block is a directory containing the block’s files.

Blocklist

The blocklist is the set of all known blocks for a tenant. Backend workers maintain it by periodically scanning object storage for meta.json files and writing a per-tenant block index.

Queriers and query frontends read this tenant index to determine which blocks to search for a given query. They fall back to scanning object storage for meta.json files only when the tenant index is unavailable or too stale. The blocklist is distributed across the cluster so that not every component needs to poll storage directly.

Tenant isolation

Tenants are fully isolated at the storage level. Each tenant’s blocks are in a separate directory prefix. There’s no cross-tenant data sharing or block merging.

Durability model

With Tempo 3.0’s Kafka-based architecture, durability works in layers.

Kafka provides immediate durability. Once data is acknowledged by Kafka, it’s safe even if all Tempo components crash. Object storage provides long-term durability. Once the block-builder flushes a block, the data is durably stored and independent of Kafka. Kafka retention bridges the gap: Kafka retains data long enough for block-builders to consume and flush it. If a block-builder is slow or restarting, Kafka holds the data until it’s processed.

There’s no single point of failure for data durability. Kafka and object storage together provide end-to-end safety.

Performance considerations

Read path

Query performance depends on how efficiently queriers can access blocks. Larger blocks (from compaction) reduce the number of blocks to search but increase individual block read time. Caching bloom filters, Parquet pages, and footers at the querier level significantly reduces object storage reads. Promoting frequently queried attributes to dedicated Parquet columns reduces the amount of data read per query.

Write path

Block-builder write performance is generally bounded by Kafka consumption rate, local disk speed (blocks are built on scratch disk before upload), and upload bandwidth to object storage.

Cost

Object storage costs are primarily driven by storage volume (total data retained, controlled by retention period and compaction efficiency) and API operations (GET/PUT/LIST calls). Compaction reduces LIST costs by consolidating blocks. Caching reduces GET costs.

Configuration

storage:
  trace:
    backend: s3  # or gcs, azure, local
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com

Refer to the storage configuration for the full list of options.

Tempo architecture reference on Grafana Labs

About the Tempo architecture

About the Tempo architecture

Design philosophy

Write path

Microservices write path

Monolithic write path

Read path

Live-stores and the recent data window

How the paths connect

Component summary

Components

Components

Partition ring

Partition ring

Tempo partitions vs Kafka partitions

Partition states

Pending

Active

Inactive

Ownership model

Live-stores

Distributors

Block-builders

Scaling

Scaling up

Scaling down

Memberlist propagation

Related resources

Deployment modes

Deployment modes

Monolithic mode

When to use monolithic mode

Limitations

Resource considerations

Example

Microservices mode

When to use microservices mode

Advantages

Component scaling guidelines

Kafka as the connecting fabric

Example

Components by deployment mode

Migrating between modes

Related resources

Block format

Block format

Why Parquet

Block structure

meta.json

Schema

Intrinsic fields vs generic attributes

Dedicated attribute columns

Block format versions

Related resources

Object storage

Object storage

Supported backends

Storage layout

Blocklist

Tenant isolation

Durability model

Performance considerations

Read path

Write path

Cost

Configuration

Related resources

`meta.json`