Pyroscope v2 components on Grafana Labs

Pyroscope v2 distributor

Mon, 20 Apr 2026 09:02:32 +0000

Pyroscope v2 distributor

The distributor is a stateless component that serves as the entry point for the ingestion path. It receives profiling data from agents and routes it to segment-writers for storage.

Profile routing

Unlike v1 where profiles are routed to ingesters based on hash ring token distribution, the v2 distributor routes profiles to segment-writers based on the profile’s service_name label. This co-location strategy ensures that profiles from the same application are stored together, which is crucial for:

Query performance: Profiles likely to be queried together are stored in the same blocks
Compaction efficiency: Related data can be compacted more effectively
Storage optimization: Reduces the number of objects needed to satisfy typical queries

Distribution algorithm

The distributor uses a three-step process to determine where to place a profile:

Tenant shards: Find suitable locations from the total shards using the tenant_id.
Dataset shards: Narrow down to locations suitable for the service_name label.
Final placement: Select the exact shard using consistent hashing or adaptive load balancing.

This algorithm balances data locality with even distribution across the cluster.

For detailed information about the distribution algorithm, refer to Data distribution.

Validation

The distributor cleans and validates data before sending it to segment-writers:

Ensures profiles have timestamps set (defaults to receive time if missing)
Removes samples with zero values
Sums samples that share the same stacktrace

If a request contains invalid data, the distributor returns a 400 HTTP status code with details in the response body.

Load balancing

Randomly load balance write requests across distributor instances. If you’re running Pyroscope in a Kubernetes cluster, you can define a Kubernetes Service as ingress for the distributors.

The distributor discovers segment-writers through memberlist-based ring discovery, which maintains the list of available segment-writer instances.

Stateless design

The distributor is completely stateless and disk-less:

Requires no local storage
Scales horizontally by adding more instances
Allows instances to be added or removed without data migration
Supports deployment in ephemeral containers

Pyroscope v2 segment-writer

Mon, 20 Apr 2026 09:02:32 +0000

Pyroscope v2 segment-writer

The segment-writer is a stateless component that accumulates incoming profiles in memory and periodically writes them to object storage as segments. This is a new component in v2 that replaces the ingester’s role in the write path.

How it works

Profile accumulation: The segment-writer receives profiles from distributors and accumulates them in memory.
Segment creation: Profiles are batched into small blocks called segments.
Object storage write: Segments are written directly to object storage.
Metadata update: The segment-writer updates the metastore with metadata about newly created segments.

Key features

Single object per shard

Each segment-writer produces a single object per shard containing data from all tenant services assigned to that shard. This approach minimizes the number of write operations to object storage, significantly reducing costs compared to writing individual objects for each tenant or service.

Synchronous ingestion

Ingestion clients are blocked until data is durably stored in object storage and an entry for the object is created in the metadata index. This guarantees data durability without requiring local disk persistence.

By default, ingestion is synchronous with median latency expected to be less than 500ms using default settings and popular object storage providers such as Amazon S3, Google Cloud Storage, and Azure Blob Storage.

In-memory accumulation

Profiles are accumulated in an in-memory database before being flushed to object storage. The in-memory structure includes:

Profile index: Efficient indexing for accumulated profiles
Inverted index: For label-based lookups during segment creation

Data co-location

Profiles from the same application (identified by the service_name label) are co-located in the same segments. This co-location is maintained by the distributor’s routing algorithm and is crucial for:

Improves query performance
Increases compaction efficiency
Optimizes storage usage

Stateless design

Unlike the v1 ingester which required local disk storage, the segment-writer is completely stateless:

Requires no persistent local storage
Writes all data directly to object storage
Scales horizontally by adding more instances
Allows instances to be added or removed without data migration
Recovers immediately after failure (no WAL replay needed)

Segment lifecycle

Creation: Profiles are accumulated in memory.
Flush: When conditions are met (time or size thresholds), a segment is written to object storage.
Registration: Segment metadata is registered in the metastore.
Compaction: Small segments are later merged into larger blocks by compaction-workers.

Failure handling

The segment writer relies on at-least-once delivery semantics. If a write fails after the segment has been uploaded to object storage but before the metastore acknowledges the metadata, the client retries the request. This can result in the same profile appearing in multiple segments, which is resolved during compaction.

Dead letter queue

If the segment writer cannot register metadata with the metastore (for example, during metastore unavailability), the metadata is written to a dead letter queue (DLQ) directory in object storage. The metastore recovers these entries in the background, ensuring that data is eventually made visible to queries.

Deployment

The segment writer runs as a StatefulSet without persistent volumes.
It participates in a hash ring used by the distributor to route profiles.
Multiple segment writers can write to the same shard during topology changes (for example, scaling events or rollouts). This is expected and handled by compaction.

Configuration

The segment-writer flush behavior can be configured to balance between:

Latency: How quickly data becomes queryable
Cost: Number of write operations to object storage
Memory usage: Amount of data held in memory before flush

Pyroscope v2 metastore

Mon, 20 Apr 2026 09:02:32 +0000

Pyroscope v2 metastore

The metastore is the only stateful component in the Pyroscope v2 architecture. It maintains the metadata index for all data objects stored in object storage and coordinates the compaction process.

Responsibilities

The metastore service is responsible for:

Metadata index: Maintaining an index of all blocks and segments in object storage
Compaction coordination: Scheduling and coordinating compaction jobs for compaction-workers
Query planning: Providing metadata to query-frontend for locating data objects
Data placement: Managing placement rules for the data distribution algorithm
Retention enforcement: Applying time-based retention policies and generating tombstones for expired data

Raft consensus

The metastore uses the Raft protocol for consensus and replication, ensuring:

Consistency: All replicas maintain the same view of the metadata
High availability: The cluster can continue operating if some nodes fail
Fault tolerance: Data is replicated across multiple nodes

Fault tolerance

Cluster size	Tolerated failures
3 nodes	1 node
5 nodes	2 nodes

Storage requirements

Even at large scale, the metastore only needs a few gigabytes of disk space for the metadata index. The index is implemented using BoltDB as the underlying key-value store.

For better performance, the index database can be stored on an in-memory volume, as it’s recovered from the Raft log and snapshot on startup. Durable storage is not required for the index itself—only for the Raft log.

Metadata index

The metadata index stores information about data objects (blocks and segments) including:

Block identifiers (ULID)
Tenant and shard assignments
Time ranges
Dataset information (service names, profile types)

The index is partitioned by time, with each partition covering a 6-hour window. Within each partition, data is organized by tenant and shard.

For detailed information about the metadata index structure, refer to Metadata index.

Compaction coordination

The metastore coordinates the compaction process by:

Job planning: Creates compaction jobs when enough segments are available.
Job scheduling: Assigns jobs to available compaction-workers.
Job tracking: Monitors job progress and handles failures.
Index updates: Updates the metadata index when compaction completes.

The compaction service uses a lease-based ownership model with fencing tokens to prevent conflicts when workers fail or become unresponsive.

For detailed information about the compaction process, refer to Compaction.

Dead letter queue

If the metastore is temporarily unavailable, segment writers fall back to writing metadata to a dead letter queue (DLQ) directory in object storage. The metastore recovers these entries in the background once it becomes available again.

Retention

The metastore enforces time-based retention policies on a per-tenant basis. Retention operates at the partition level: entire partitions are removed when they exceed the configured retention period, rather than evaluating individual blocks. When partitions are deleted, tombstones are created for the underlying data objects, which are eventually cleaned up by compaction workers.

Query support

The metastore provides linearizable reads for query operations, ensuring that:

Queries observe the most recent committed state
Previous writes are visible to read operations
Both leader and follower replicas can serve queries

Leader election

One metastore instance is elected as the leader through Raft consensus. The leader is responsible for:

Processing write requests
Coordinating compaction scheduling
Enforcing retention policies
Running cleanup operations
Recovering metadata entries from the dead letter queue

Follower replicas can serve read requests, distributing the query load across the cluster.

Pyroscope v2 compaction-worker

Mon, 20 Apr 2026 09:02:32 +0000

Pyroscope v2 compaction-worker

The compaction-worker is a stateless component responsible for merging small segments into larger blocks. This improves query performance by reducing the number of objects that need to be read from object storage.

Why compaction is needed

The ingestion pipeline creates many small segments—potentially millions of objects per hour at scale. Without compaction, this leads to:

Read amplification: Queries must fetch many small objects
Increased costs: More API calls to object storage
Metadata bloat: The metastore index grows unboundedly
Performance degradation: Both read and write paths slow down

How it works

Job polling: Workers poll the metastore for available compaction jobs.
Segment download: Workers download source segments from object storage.
Merge operation: Matching datasets from different segments are merged.
Block upload: The compacted block is uploaded to object storage.
Status report: Workers report job completion to the metastore.

Compaction speed

Compaction workers compact data as soon as possible after it’s written to object storage:

Median time to first compaction: Less than 15 seconds
Continuous operation: Workers constantly poll for new jobs

This ensures that query performance remains optimal even during high ingestion rates.

Job scheduling

Compaction jobs are coordinated by the metastore, which:

Creates jobs when enough segments are available for compaction
Assigns jobs to workers based on available capacity
Tracks job progress and handles failures
Uses a “Small Job First” strategy to prioritize smaller blocks

Workers specify their available capacity when polling for jobs, allowing the system to adapt to the available resources.

Data layout

Profiling data from each service (identified by the service_name label) is stored as a separate dataset within a block. During compaction:

Matching datasets from different blocks are merged
TSDB indexes are combined
Symbols and profile tables are merged and rewritten

The output block contains non-overlapping, independent datasets optimized for efficient reading.

Stateless design

Compaction workers are completely stateless:

Require no persistent local storage
Scale horizontally by adding more instances
Allow instances to be added or removed at any time
Use default concurrency based on available CPU cores

Fault tolerance

If a compaction worker fails:

The job lease expires
The metastore reassigns the job to another worker
Source segments remain in object storage until compaction succeeds

Jobs that repeatedly fail are deprioritized to prevent blocking the compaction queue.

Garbage collection

After compaction completes, the original source blocks are not immediately deleted. Instead, tombstones are created in the metastore. The actual deletion happens after a configurable delay, giving queries time to discover the new compacted blocks and stop accessing the original ones. Eventually, tombstones are included in compaction jobs, and the worker removes the source objects from object storage.

For detailed information about the compaction process, refer to Compaction.

Pyroscope v2 query-frontend

Mon, 20 Apr 2026 09:02:32 +0000

Pyroscope v2 query-frontend

The query-frontend is a stateless component that serves as the entry point for the query path. It handles query planning and routes requests to query-backend instances for execution.

Responsibilities

The query-frontend is responsible for:

Receiving and validating queries through the Query API
Executing queries by using the metastore for block discovery and delegating execution to query-backend instances

Query flow

When a query arrives, the query frontend:

Validates the query request.
Queries the metastore to find all blocks matching the query criteria (time range, tenant, and optionally service name).
Builds a physical query plan as a tree: leaf nodes are read operations targeting specific blocks and datasets, while intermediate nodes are merge operations that combine results from their children.
Sends the plan root to a query backend instance, which distributes subtrees to other query backend instances for parallel execution and merging. For more details, refer to Query backend.

Because the metastore serves block metadata from memory with linearizable reads, query planning is fast and does not require the query frontend to maintain any local state about blocks.

Stateless design

The query-frontend is completely stateless:

Requires no persistent storage
Scales horizontally to hundreds of instances
Allows instances to be added or removed without coordination
Supports auto-scaling based on query load

Scalability

The query-frontend can scale independently of the write path:

Heavy query workloads don’t impact ingestion performance
Handles increased query volume by adding more instances
Works with any number of query-backend instances

Load balancing

Query-frontends can be load balanced using standard HTTP load balancers. Each instance can handle any query, making round-robin load balancing effective.

Pyroscope v2 query-backend

Mon, 20 Apr 2026 09:02:32 +0000

Pyroscope v2 query-backend

The query-backend is a stateless component that executes queries with high parallelism. It reads data directly from object storage and processes it according to the query plan received from the query-frontend.

How it works

The query frontend builds a physical query plan as a tree structure:

Read nodes (leaves) fetch and process data from specific blocks in object storage.
Merge nodes (intermediate) combine results from their child nodes.

The query frontend sends the plan root to a query backend instance. That instance distributes subtrees to other query backend instances for parallel execution, collects their results, and merges them. The final merged result is returned to the query frontend, which forwards it to the client.

This tree-based execution allows queries to fan out across many query backend instances in parallel, with merging happening at each level of the tree rather than in a single aggregation point.

Direct object storage access

Unlike v1 where queries may need to access ingesters for recent data, the v2 query-backend reads directly from object storage:

No coordination with write-path components needed
Simplified query execution
Better isolation between read and write paths
Easier horizontal scaling

Stateless design

The query-backend is completely stateless:

Requires no persistent storage
Needs no caching layer (reads directly from object storage)
Scales horizontally to hundreds of instances
Allows instances to be added or removed without coordination

Scalability

The query-backend enables horizontal scaling of the read path:

Handles heavier query workloads by adding more instances
Scales independently of the write path
Shares no state between instances
Supports auto-scaling based on query load

Performance characteristics

High parallelism: Multiple blocks processed concurrently
Memory efficient: Tree-based execution minimizes memory requirements
Network optimized: Results combined close to the data source