Grafana Mimir architecture on Grafana Labs

About the Grafana Mimir architecture

Mon, 01 Jan 0001 00:00:00 +0000

About the Grafana Mimir architecture

Grafana Mimir has a microservices-based architecture. The system has multiple horizontally scalable microservices that can run separately and in parallel. Grafana Mimir microservices are called components.

Grafana Mimir’s design compiles the code for all components into a single binary. The -target parameter controls which component(s) that single binary will behave as. For those looking for a simple way to get started, Grafana Mimir can also be run in monolithic mode, with all components running simultaneously in one process. For more information, refer to Deployment modes.

Grafana Mimir components

Most components are stateless and do not require any data persisted between process restarts. Some components are stateful and rely on non-volatile storage to prevent data loss between process restarts. For details about each component, see its page in Components.

The write path

Ingesters receive incoming samples from the distributors. Each push request belongs to a tenant, and the ingester appends the received samples to the specific per-tenant TSDB that is stored on the local disk. The samples that are received are both kept in-memory and written to a write-ahead log (WAL). If the ingester abruptly terminates, the WAL can help to recover the in-memory series. The per-tenant TSDB is lazily created in each ingester as soon as the first samples are received for that tenant.

The in-memory samples are periodically flushed to disk, and the WAL is truncated, when a new TSDB block is created. By default, this occurs every two hours. Each newly created block is uploaded to long-term storage and kept in the ingester until the configured -blocks-storage.tsdb.retention-period expires. This gives queriers and store-gateways enough time to discover the new block on the storage and download its index-header.

To effectively use the WAL, and to be able to recover the in-memory series if an ingester abruptly terminates, store the WAL to a persistent disk that can survive an ingester failure. For example, when running in the cloud, include an AWS EBS volume or a GCP persistent disk. If you are running the Grafana Mimir cluster in Kubernetes, you can use a StatefulSet with a persistent volume claim for the ingesters. The location on the filesystem where the WAL is stored is the same location where local TSDB blocks (compacted from head) are stored. The location of the filesystem and the location of the local TSDB blocks cannot be decoupled.

For more information, refer to timeline of block uploads and Ingester.

Series sharding and replication

By default, each time series is replicated to three ingesters, and each ingester writes its own block to the long-term storage. The Compactor merges blocks from multiple ingesters into a single block, and removes duplicate samples. Blocks compaction significantly reduces storage utilization. For more information, refer to Compactor and Production tips.

The read path

Queries coming into Grafana Mimir arrive at the query-frontend. The query-frontend then splits queries over longer time ranges into multiple, smaller queries.

The query-frontend next checks the results cache. If the result of a query has been cached, the query-frontend returns the cached results. Queries that cannot be answered from the results cache are put into an in-memory queue within the query-frontend.

Note: If you run the optional query-scheduler component, this queue is maintained in the query-scheduler instead of the query-frontend.

The queriers act as workers, pulling queries from the queue.

The queriers connect to the store-gateways and the ingesters to fetch all the data needed to execute a query. For more information about how the query is executed, refer to querier.

After the querier executes the query, it returns the results to the query-frontend for aggregation. The query-frontend then returns the aggregated results to the client.

The role of Prometheus

Prometheus instances scrape samples from various targets and push them to Grafana Mimir by using Prometheus’ remote write API. The remote write API emits batched Snappy-compressed Protocol Buffer messages inside the body of an HTTP PUT request.

Mimir requires that each HTTP request has a header that specifies a tenant ID for the request. Request authentication and authorization are handled by an external reverse proxy.

Incoming samples (writes from Prometheus) are handled by the distributor, and incoming reads (PromQL queries) are handled by the query frontend.

Long-term storage

The Grafana Mimir storage format is based on Prometheus TSDB storage. The Grafana Mimir storage format stores each tenant’s time series into their own TSDB, which persists series to an on-disk block. By default, each block has a two-hour range. Each on-disk block directory contains an index file, a file containing metadata, and the time series chunks.

The TSDB block files contain samples for multiple series. The series inside the blocks are indexed by a per-block index, which indexes both metric names and labels to time series in the block files.

Grafana Mimir requires any of the following object stores for the block files:

Grafana Mimir deployment modes

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir deployment modes

You can deploy Grafana Mimir in one of two modes:

Monolithic mode
Microservices mode

The deployment mode is determined by the -target parameter, which you can set via CLI flag or YAML configuration.

Monolithic mode

The monolithic mode runs all required components in a single process and is the default mode of operation, which you can set by specifying -target=all. Monolithic mode is the simplest way to deploy Grafana Mimir and is useful if you want to get started quickly or want to work with Grafana Mimir in a development environment. To see the list of components that run when -target is set to all, run Grafana Mimir with the -modules flag:

./mimir -modules

Monolithic mode can be horizontally scaled out by deploying multiple Grafana Mimir binaries with -target=all. This approach provides high-availability and increased scale without the configuration complexity of the full microservices deployment.

Microservices mode

In microservices mode, components are deployed in distinct processes. Scaling is per component, which allows for greater flexibility in scaling and more granular failure domains. Microservices mode is the preferred method for a production deployment, but it is also the most complex.

In microservices mode, each Grafana Mimir process is invoked with its -target parameter set to a specific Grafana Mimir component (for example, -target=ingester or -target=distributor). To get a working Grafana Mimir instance, you must deploy every required component. For more information about each of the Grafana Mimir components, refer to Architecture.

If you are interested in deploying Grafana Mimir in microservices mode, we recommend that you use Kubernetes.

Grafana Mimir components

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir components

Grafana Mimir includes a set of components that interact to form a cluster.

Grafana Mimir binary index-header

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir binary index-header

To query series inside blocks from object storage, the store-gateway must obtain information about each block index. To obtain the required information, the store-gateway builds an index-header for each block and stores it on local disk.

The store-gateway uses GET byte range request to build the index-header, which contains specific sections of the block’s index. The store-gateway uses the index-header at query time.

Because downloading specific sections of the original block’s index is a computationally easy operation, the index-header is not uploaded to the object storage. If the index-header is not available on local disk, store-gateway instances (or the same instance after a rolling update completes without a persistent disk) re-build the index-header from the original block’s index.

Format (version 1)

The index-header is a subset of the block index and contains:

Symbol Table: Used to unintern string values
Posting Offset Table: Used to look up postings

The following example shows the format of the index-header file that is located in each block store-gateway local directory. It is terminated by a table of contents that serves as an entry point into the index.

┌─────────────────────────────┬───────────────────────────────┐
│ magic(0xBAAAD792) <4b> │ version(1) <1 byte> │
├─────────────────────────────┬───────────────────────────────┤
│ index version(2) <1 byte> │ index PostingOffsetTable <8b> │
├─────────────────────────────┴───────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Symbol Table (exact copy from original index) │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Posting Offset Table (exact copy from index) │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ TOC │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Grafana Mimir bucket index

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir bucket index

The bucket index is a per-tenant file that contains the list of blocks and block deletion marks in the storage. The bucket index is stored in the backend object storage, is periodically updated by the compactor, and used by queriers, store-gateways, and rulers (in internal operational mode) to discover blocks in the storage.

The bucket index is enabled by default, but is optional. It can be disabled via -blocks-storage.bucket-store.bucket-index.enabled=false (or its respective YAML configuration option). Disabling the bucket index is not recommended.

Benefits

The querier, store-gateway and ruler must have an almost up-to-date view of the storage bucket, in order to find the right blocks to lookup at query time (querier) and load block’s index-header (store-gateway). Because of this, they need to periodically scan the bucket to look for new blocks uploaded by ingester or compactor, and blocks deleted (or marked for deletion) by compactor.

When the bucket index is enabled, the querier, store-gateway, and ruler periodically look up the per-tenant bucket index instead of scanning the bucket via list objects operations.

This provides the following benefits:

Reduced number of API calls to the object storage by querier and store-gateway
No “list objects” storage API calls performed by querier and store-gateway
The querier is up and running immediately after the startup, so there is no need to run an initial bucket scan

Structure of the index

The bucket-index.json.gz contains:

blocks
List of complete blocks of a tenant, including blocks marked for deletion. Partial blocks are excluded from the index.
block_deletion_marks
List of block deletion marks.
updated_at
A Unix timestamp, with precision measured in seconds, displays the last time index was updated and written to the storage.

How it gets updated

The compactor periodically scans the bucket and uploads an updated bucket index to the storage. You can configure the frequency with which the bucket index is updated via -compactor.cleanup-interval.

The use of the bucket index is optional, but the index is built and updated by the compactor even if -blocks-storage.bucket-store.bucket-index.enabled=false. This behavior ensures that the bucket index for any tenant exists and that query result consistency is guaranteed if a Grafana Mimir cluster operator enable the bucket index in a live cluster. The overhead introduced by keeping the bucket index updated is not signifcant.

How it’s used by the querier

At query time the querier and ruler determine whether the bucket index for the tenant has already been loaded to memory. If not, the querier and ruler download it from the storage and cache it.

Because the bucket index is a small file, lazy downloading it doesn’t have a significant impact on first query performances, but it does allow a querier to get up and running without pre-downloading every tenant’s bucket index. In addition, if the metadata cache is enabled, the bucket index is cached for a short time in a shared cache, which reduces the latency and number of API calls to the object storage in case multiple queriers and rulers fetch the same tenant’s bucket index within a short time.

While in-memory, a background process keeps the bucket index updated periodically so that subsequent queries from the same tenant to the same querier instance uses the cached (and periodically updated) bucket index.

The following configuration options determine bucket index update intervals:

-blocks-storage.bucket-store.sync-interval
This option configures how frequently a cached bucket index is refreshed.
-blocks-storage.bucket-store.bucket-index.update-on-error-interval
If downloading a bucket index fails, the failure is cached for a short time so that the backend storage doesn’t experience a large volume of storage requests. This option configures the frequency with which the bucket store attempts to load a failed bucket index.

If a bucket index is unused for the amount of time configured via -blocks-storage.bucket-store.bucket-index.idle-timeout, (for example, if a querier instance is not receiving any query from the tenant), the querier offload its, which stops the querier from updating it at regular intervals. This is useful for tenants that are resharded to different queriers when shuffle sharding is enabled.

At query time the querier and ruler determine how old a bucket index is based on its updated_at. If the age is older than the period configured via -blocks-storage.bucket-store.bucket-index.max-stale-period a query fails. This circuit breaker ensures queriers and rulers do not return any partial query results due to a stale view over the long-term storage.

How it’s used by the store-gateway

The store-gateway, at startup and periodically, fetches the bucket index for each tenant that belong to their shard, and uses it as the source of truth for the blocks and deletion marks in the storage. This removes the need to periodically scan the bucket to discover blocks belonging to their shard.

Grafana Mimir hash rings

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir hash rings

Hash rings are a distributed consistent hashing scheme and are widely used by Grafana Mimir for sharding and replication.

How the hash ring works in Grafana Mimir

The hash ring in Grafana Mimir is used to share work across several replicas of a component in a consistent way, so that any other component can decide which address to talk to. The workload or data to share is hashed first and the result of the hashing is used to find which ring member owns it.

Grafana Mimir uses the fnv32a hash function, which returns 32-bit unsigned integers so its value can be between 0 and (2^32)-1, inclusive. This value is called token and used as the ID of the data. The token determines the location on the hash ring deterministically. This allows independent determination of what instance of Grafana Mimir is the authoritative owner of any specific data.

For example, series are sharded across ingesters. The token of a given series is computed by hashing all of the series’ labels and the tenant ID: the result of which is an unsigned 32-bit integer within the space of the tokens. The ingester that owns that series is the instance that owns the range of the tokens, including the series’ token.

To divide up set of possible tokens (2^32) across the available instances within the cluster, all of the running instances of a given Grafana Mimir component, such as the ingesters, join a hash ring. The hash ring is a data structure that splits the space of the tokens into multiple ranges, and assigns each range to a given Grafana Mimir ring member.

Upon startup, an instance generates random token values, and it registers them into the ring. The values that each instance registers determine which instance owns a given token. A token is owned by the instance that registered the smallest value that is higher than the token being looked up (by wrapping around zero when it reaches (2^32)-1).

To replicate the data across multiple instances, Grafana Mimir finds the replicas by starting from the authoritative owner of the data and walking the ring clockwise. Data is replicated to the next instances found while walking the ring.

A practical example

To better understand how it works, take four ingesters and a tokens space between 0 and 9 as an example:

Ingester #1 is registered in the ring with the token 2
Ingester #2 is registered in the ring with the token 4
Ingester #3 is registered in the ring with the token 6
Ingester #4 is registered in the ring with the token 9

Grafana Mimir receives an incoming sample for the series {__name__="cpu_seconds_total",instance="1.1.1.1"}. It hashes the series’ labels, and the result of the hashing function is the token 3.

To find which ingester owns token 3, Grafana Mimir looks up the token 3 in the ring and finds the ingester that is registered with the smallest token larger than 3. The ingester #2, which is registered with token 4, is the authoritative owner of the series {__name__="cpu_seconds_total",instance="1.1.1.1"}.

By default, Grafana Mimir replicates each series to three ingesters. After finding the authoritative owner of the series, Grafana Mimir continues to walk the ring clockwise to find the remaining two instances where the series should be replicated. In the example that follows, the series are replicated to the instances of Ingester #3 and Ingester #4.

Consistent hashing

The hash ring guarantees the property known as consistent hashing.

When an instance is added or removed from the ring, consistent hashing minimizes the number of tokens that are moved from one instance to another. On average, the number of tokens that need to move to a different instance is only n/m, where n is the total number of tokens (32-bit unsigned integer) and m is the number of instances that are registered in the ring.

Components that use the hash ring

There are several Grafana Mimir components that need a hash ring. Each of the following components builds an independent hash ring:

Ingesters shard and replicate series.
Distributors enforce rate limits.
Compactors shard compaction workload.
Store-gateways shard blocks to query from long-term storage.
(Optional) Rulers shard rule groups to evaluate.
(Optional) Alertmanagers shard tenants.

How the hash ring is shared between Grafana Mimir instances

Hash ring data structures need to be shared between Grafana Mimir instances. To propagate changes to the hash ring, Grafana Mimir uses a key-value store. The key-value store is required and can be configured independently for the hash rings of different components.

For more information, see the key-value store documentation.

Features that are built using the hash ring

Grafana Mimir primarily uses the hash ring for sharding and replication. Features that are built using the hash ring:

Service discovery: Instances can discover each other looking up who is registered in the ring.
Heartbeating: Instances periodically send an heartbeat to the ring to signal they’re up and running. An instance is considered unhealthy if misses the heartbeat for some period of time.
Zone-aware replication: Zone-aware replication is the replication of data across failure domains and can be optionally enabled in Grafana Mimir. For more information, see configuring zone-aware replication.
Shuffle sharding: Grafana Mimir optionally supports shuffle sharding in a multi-tenant cluster, to reduce the blast radius of an outage and better isolate tenants. For more information, refer to configure shuffle sharding.

Grafana Mimir key-value store

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir key-value store

A key-value (KV) store is a database that stores data indexed by key. Grafana Mimir requires a key-value store for the following features:

Supported key-value store backends

Grafana Mimir supports the following key-value (KV) store backends:

Gossip-based memberlist protocol (default)
Consul
Etcd

Gossip-based memberlist protocol (default)

By default, Grafana Mimir instances use a Gossip-based protocol to join a memberlist cluster. The data is shared between the instances using peer-to-peer communication and no external dependency is required.

We recommend that you use memberlist to run Grafana Mimir.

To configure memberlist, refer to configuring hash rings.

Consul

Grafana Mimir supports Consul as a backend KV store. If you want to use Consul, you must install it. The Grafana Mimir installation does not include Consul.

To configure Consul, refer to configuring hash rings.

Etcd

Grafana Mimir supports etcd as a backend KV store. If you want to use etcd, you must install it. The Grafana Mimir installation does not include etcd.

To configure etcd, refer to configuring hash rings.

Grafana Mimir memberlist and gossip protocol

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir memberlist and gossip protocol

Memberlist is a Go library that manages cluster membership, node failure detection, and message passing using a gossip-based protocol. Memberlist is eventually consistent and network partitions are partially tolerated by attempting to communicate to potentially dead nodes through multiple routes.

By default, Grafana Mimir uses memberlist to implement a key-value (KV) store to share the hash ring data structures between instances.

When using a memberlist-based KV store, each instance maintains a copy of the hash rings. Each Mimir instance updates a hash ring locally and uses memberlist to propagate the changes to other instances. Updates generated locally and updates received from other instances are merged together to form the current state of the ring on the instance.

To configure memberlist, refer to configuring hash rings.

How memberlist propagates hash ring changes

When using a memberlist-based KV store, every Grafana Mimir instance propagates the hash ring data structures to other instances using the following techniques:

Propagating only the differences introduced in recent changes.
Propagating the full hash ring data structure.

Every -memberlist.gossip-interval an instance randomly selects a subset of all Grafana Mimir cluster instances configured by -memberlist.gossip-nodes and sends the latest changes to the selected instances. This operation is performed frequently and it’s the primary technique used to propagate changes.

In addition, every -memberlist.pullpush-interval an instance randomly selects another instance in the Grafana Mimir cluster and transfers the full content of the KV store, including all hash rings (unless -memberlist.pullpush-interval is zero, which disables this behavior). After this operation is complete, the two instances have the same content as the KV store. This operation is computationally more expensive, and as a result, it’s performed less frequently. The operation ensures that the hash rings periodically reconcile to a common state.

Grafana Mimir query sharding

Mon, 01 Jan 0001 00:00:00 +0000

Grafana Mimir query sharding

Mimir includes the ability to run a single query across multiple machines. This is achieved by breaking the dataset into smaller pieces. These smaller pieces are called shards. Each shard then gets queried in a partial query, and those partial queries are distributed by the query-frontend to run on different queriers in parallel. The results of those partial queries are aggregated by the query-frontend to return the full query result.

Query sharding at glance

Not all queries are shardable. While the full query is not shardable, the inner parts of a query could still be shardable.

In particular associative aggregations (like sum, min, max, count, avg) are shardable, while some query functions (like absent, absent_over_time, histogram_quantile, sort_desc, sort) are not.

In the following examples we look at a concrete example with a shard count of 3. All the partial queries that include a label selector __query_shard__ are executed in parallel. The concat() annotation is used to show when partial query results are concatenated/merged by the query-frontend.

Example 1: Full query is shardable

sum(rate(metric[1m]))

Is executed as (assuming a shard count of 3):

sum(
concat(
sum(rate(metric{__query_shard__="1_of_3"}[1m]))
sum(rate(metric{__query_shard__="2_of_3"}[1m]))
sum(rate(metric{__query_shard__="3_of_3"}[1m]))
)
)

Example 2: Inner part is shardable

histogram_quantile(0.99, sum by(le) (rate(metric[1m])))

Is executed as (assuming a shard count of 3):

histogram_quantile(0.99, sum by(le) (
concat(
sum by(le) (rate(metric{__query_shard__="1_of_3"}[1m]))
sum by(le) (rate(metric{__query_shard__="2_of_3"}[1m]))
sum by(le) (rate(metric{__query_shard__="3_of_3"}[1m]))
)
))

Example 3: Query with two shardable portions

sum(rate(failed[1m])) / sum(rate(total[1m]))

Is executed as (assuming a shard count of 3):

sum(
concat(
sum (rate(failed{__query_shard__="1_of_3"}[1m]))
sum (rate(failed{__query_shard__="2_of_3"}[1m]))
sum (rate(failed{__query_shard__="3_of_3"}[1m]))
)
)
/
sum(
concat(
sum (rate(total{__query_shard__="1_of_3"}[1m]))
sum (rate(total{__query_shard__="2_of_3"}[1m]))
sum (rate(total{__query_shard__="3_of_3"}[1m]))
)
)

How to enable query sharding

In order to enable query sharding you need to opt-in by setting -query-frontend.parallelize-shardable-queries to true.

Each shardable portion of a query is split into -query-frontend.query-sharding-total-shards partial queries. If a query has multiple inner portions that can be sharded, each portion is sharded -query-frontend.query-sharding-total-shards times. In some cases, this could lead to an explosion of queries. For this reason, there is a parameter that allows to modify the default hard limit of 128 queries on the total number of partial queries a single input query can generate: -query-frontend.query-sharding-max-sharded-queries.

When running a query over a large time range and -query-frontend.split-queries-by-interval is enabled, the -query-frontend.query-sharding-max-sharded-queries limit applies on the total number of queries which have been split by time (first) and by shards (second).

As an example, if -query-frontend.query-sharding-max-sharded-queries=128 and -query-frontend.split-queries-by-interval=24h, and you run a query over 8 days, each daily query will have a max of 128 / 8 days = 16 partial queries per day.

After enabling query sharding in a microservices deployment, the query frontends will start processing the aggregation of the partial queries. Hence it is important to configure some PromQL engine specific parameters on the query-frontend too:

-querier.max-concurrent
-querier.timeout
-querier.max-samples
-querier.default-evaluation-interval
-querier.lookback-delta

Operational considerations

Splitting a single query into sharded queries increases the quantity of queries that must be processed. Parallelization decreases the query processing time, but increases the load on querier components and their underlying data stores (ingesters for recent data and store-gateway for historic data). The caching layer for chunks and indexes will also experience an increased load.

We also recommend to increase the maximum number of queries scheduled in parallel by the query-frontend, multiplying the previously set value of -querier.max-query-parallelism by -query-frontend.query-sharding-total-shards.

Verification

Query statistics

The query statistics logged by the query-frontend allow to check if query sharding was used for an individual query. The field sharded_queries contains the amount of parallelly executed partial queries.

When sharded_queries is 0, either the query is not shardable or query sharding is disabled for cluster or tenant. This is a log line of an unshardable query:

sharded_queries=0 param_query="absent(up{job=\"my-service\"})"

When sharded_queries matches the configured shard count, query sharding is operational and the query has only a single leg (assuming time splitting is disabled or the query doesn’t span across multiple days). The following log line represents that case with a shard count of 16:

sharded_queries=16 query="sum(rate(prometheus_engine_queries[5m]))"

When sharded_queries is a multiple of the configured shard count, query sharding is operational and the query has multiple legs (assuming time splitting is disabled or the query doesn’t span across multiple days). The following log line shows a query with two legs and with a configured shard count of 16:

sharded_queries=32 query="sum(rate(prometheus_engine_queries{engine=\"ruler\"}[5m]))/sum(rate(prometheus_engine_queries[5m]))"

The query-frontend also exposes metrics, which can be useful to understand the query workload’s parallelism as a whole.

You can run the following query to get the ratio of queries which have been successfully sharded:

sum(rate(cortex_frontend_query_sharding_rewrites_succeeded_total[$__rate_interval])) /
sum(rate(cortex_frontend_query_sharding_rewrites_attempted_total[$__rate_interval]))

The histogram cortex_frontend_sharded_queries_per_query allows to understand how many sharded sub queries are generated per query.