This is documentation for the next version of Metrics enterprise. For the latest stable release, go to the latest version.

Configuration

Grafana Enterprise Metrics configuration

Grafana Enterprise Metrics utilizes a configuration file that is a superset of the Cortex configuration file.

The following extensions are available in the Grafana Enterprise Metrics configuration:

Authentication

You can enable the built-in, token based authentication mechanism in Grafana Enterprise Metrics by adding the following code to your configuration file:

auth:
  type: enterprise

Admin backend storage

You need to configure Grafana Enterprise Metrics with a bucket that stores administration objects such as clusters, access policies, tokens, or licenses. Ideally, this bucket is separate from the bucket that stores the TSDB blocks.

If you do not configure the object-storage bucket or do so incorrectly, Grafana Enterprise Metrics is unable to find its license and warnings will be logged. Failure to find a license is not a fatal error to minimise operational disruption in the case of misconfiguration or license expiry.

The client that is used is configured using the same configuration options as configuring a blocks storage backend. All of the clients supported for blocks storage are also supported for the admin API.

To verify that you have configured the object-storage bucket correctly, see either the S3 / Minio backend if you are using an S3 compatible API, or Google Cloud Storage (GCS) backend if you are using Google Cloud Storage.

Below are snippets that can be included in your configuration to set up each of the common object storage backends:

S3 / Minio backend

admin_client:
  storage:
    type: s3
    s3:
      # The S3 bucket endpoint. It could be an AWS S3 endpoint listed at
      # https://docs.aws.amazon.com/general/latest/gr/s3.html or the address of an
      # S3-compatible service in hostname:port format.
      [endpoint: <string> | default = ""]

      # S3 bucket name
      [bucket_name: <string> | default = ""]

      # S3 secret access key
      [secret_access_key: <string> | default = ""]

      # S3 access key ID
      [access_key_id: <string> | default = ""]

      # If enabled, use http:// for the S3 endpoint instead of https://. This could
      # be useful in local dev/test environments while using an S3-compatible
      # backend storage, like Minio.
      [insecure: <boolean> | default = false]

Google Cloud Storage (GCS) backend

admin_client:
  storage:
    type: gcs
    gcs:
      # GCS bucket name
      [bucket_name: <string> | default = ""]

      # JSON representing either a Google Developers Console client_credentials.json
      # file or a Google Developers service account key file. If empty, fallback to
      # Google default logic.
      [service_account: <string> | default = ""]

Blocks storage

Grafana Enterprise Metrics extends Cortex’s blocks storage configuration by those features:

Rate limiting of list calls

List calls to a storage bucket can be rate limited. The blocks-storage.bucket-rate-limit.limit configuration option specifies how often per second a list call can be made before being rate limited. The blocks-storage.bucket-rate-limit.burst configuration option specifies the burst size. If blocks-storage.bucket-rate-limit.limit is smaller than or is equal to zero (the default) no rate limiting is applied. The snippet below shows how this can be set in your configuration file for an S3 backend:

blocks_storage:
  backend:                 s3
  bucket_rate_limit:       100
  bucket_rate_limit_burst: 1

Compactor

Grafana Enterprise Metrics extends Cortex’s compactor component with several options that improve its performance.

Time-sharding strategy

GEM adds the parameter sharding_strategy to the compactor configuration.

When set to default the compactor behaves the same as it does in Cortex.

When set to time-sharding, a single compactor instance can parallelize the compaction of multiple groups of blocks. Use the compaction_concurrency parameter to set the maximum number (inclusive) of concurrent compactions allowed on a single compactor.

The workflow of a time-sharding compaction strategy is as follows:

  1. For each tenant that belongs to the compactor’s shard, the following steps occur:
    1. Find groups of compactable blocks, where each group’s time range doesn’t overlap with other groups
    2. Concurrently compact the groups of blocks, up until compaction_concurrency concurrent compactions.
    3. Repeat until the tenant has no remaining compactable blocks.
  2. Repeat until the compactor has compacted all of the tenants that belong to its shard.

To enable time-based sharding in the compactor, use the following configuration options:

compactor:
  sharding_strategy: time-sharding
  compaction_concurrency: 4

[EXPERIMENTAL] Split and merge compaction

NOTE: split-and-merge compaction is incompatible with the time-sharding strategy described above. To use split-and-merge compaction, sharding_strategy must be set to default. It CANNOT be set to time-sharding.

split-and-merge compaction is an experimental feature that allows the user to vertically and horizontally parallelize compaction for a single tenant. This is useful for metrics clusters with very large tenants.

split-and-merge compaction also allows GEM to overcome TSDB index limitations and prevent compacted blocks from growing indefinitely for a very large tenant (at any compaction stage).

split-and-merge compaction is a two stage process: split and merge.

For the configured 1st level of compaction (eg. 2h), the compactor divides all source blocks into N groups. For each group, the compactor compacts together the blocks, but instead of returning 1 compacted block (as with the default strategy), it outputs N blocks, called split blocks. Each split block contains a subset of the series. Series are sharded across the N split blocks using a stable hashmod function. At the end of the split stage, the compactor will have produced N * N blocks with a reference to their shard in the block’s meta.json.

Given the split blocks, the compactor runs the merge stage which compacts together all split blocks of a given shard. Once this stage is completed, the number of blocks will be reduced by a factor of N. Given a compaction time range, we’ll have a compacted block for each shard.

The merge stage is then run for subsequent compaction time ranges (eg. 12h, 24h), compacting together blocks belonging to the same shard (not shown in the picture below).

Split and merge compaction strategy

The N number of split blocks is configurable on a per-tenant basis (-compactor.split-and-merge-shards) and can be adjusted based on the number of series of each tenant. The more a tenant grows in terms of series, the more you can grow the configured number of shards, in order to improve compaction parallelization and keep each per-shard compacted block size under control. We currently recommend 1 shard per 25-30 million active series in a tenant. This means that for a tenant with 100 million active series you would set split-and-merge-shards=4. Note: This recommendation may change as this feature is still experimental.

Vertical scaling with split-and-merge compaction

To get vertical scaling when compaction_strategy=split-and-merge use the -compactor.compaction-concurrency flag. The compaction-concurrency flag is the max number of concurrent compactions allowed to run in a single compactor replica (each compaction uses 1 CPU core).

Horizontal scaling with split-and-merge compaction

To get horizontal scaling when compaction_strategy=split-and-merge, sharding must be enabled for the compactor (sharding_enabled=true). Compaction jobs (both split and merge) will then be spread across compactor-tenant-shard-size number of compactor replicas. If compactor-tenant-shard-size is set to 0, jobs will be spread out over all compactor replicas.

Vertical and horizontal scaling both accomplish the same thing - allowing you to speed up compaction by parallelizing it either across multiple compactor replicas (horizontal scaling) or across multiple CPUs on the same compactor replica (vertical scaling). The operator can choose to use neither, one, or both types of scaling.

Below is a sample configuration for enabling split-and-merge compaction in the compactor:

compactor:
  # Must be set to default. it cannot be set to 'time-sharding'
  sharding_strategy: default
  # Enable split-and-merge compaction
  compaction_strategy: split-and-merge
  # Enable vertical scaling during the split-and-merge compaction.
  # A single compactor replica can execute up to compaction_concurrency non-conflicting
  # compaction jobs for a single tenant at once (in this example, 4 jobs at once). 
  compaction_concurrency: 4
  # Enable horizontal scaling during split-and-merge compaction. 
  # Compaction jobs will be sharded across all available compactor replicas. 
  sharding_enabled: true
limits:
  # Number of shards per tenant
  compactor_split_and_merge_shards: 2
  # Number of compactor replicas that a tenant's compaction jobs can be shared across
  compactor_tenant_shard_size: 2

How does split-and-merge compaction behave if -compactor.split-and-merge-shards changes?

In case you change the -compactor.split-and-merge-shards setting, the change will affect only compaction of blocks which haven’t been split yet. Blocks which have already run through the split stage will not be split again to produce a number of shards equal to the new setting, but will be merged keeping the old configuration (this information is stored in the meta.json of each split block).

Ruler

The Ruler is an optional service that executes PromQL queries in order to record rules and alerts. The ruler requires backend storage for the recording rules and alerts for each tenant.

Grafana Enterprise Metrics extends Cortex’s upstream ruler configuration by those features:

Remote-write forwarding

The following configuration options can be used to enable remote write rule groups and specify a desired directory to store the generated write-ahead log (WAL). To learn more see the remote-write rule forwarding documentation

ruler:
  remote_write:
    enabled: <bool. defaults to false>
    wal_dir: <directory path. defaults to ./wal>

Gateway

For information about the gateway and its configuration, refer to Gateway.