Operations on Grafana Labs

Observability

Tue, 16 Jul 2024 15:42:20 +0000

Observing Grafana Loki

Both Grafana Loki and Promtail expose a /metrics endpoint that expose Prometheus metrics. You will need a local Prometheus and add Loki and Promtail as targets. See configuring Prometheus for more information.

All components of Loki expose the following metrics:

Metric Name	Metric Type	Description
`loki_log_messages_total`	Counter	Total number of messages logged by Loki.
`loki_request_duration_seconds`	Histogram	Number of received HTTP requests.

The Loki Distributors expose the following metrics:

Metric Name	Metric Type	Description
`loki_distributor_ingester_appends_total`	Counter	The total number of batch appends sent to ingesters.
`loki_distributor_ingester_append_failures_total`	Counter	The total number of failed batch appends sent to ingesters.
`loki_distributor_bytes_received_total`	Counter	The total number of uncompressed bytes received per both tenant and retention hours.
`loki_distributor_lines_received_total`	Counter	The total number of log entries received per tenant (not necessarily of lines, as an entry can have more than one line of text).

The Loki Ingesters expose the following metrics:

Metric Name	Metric Type	Description
`cortex_ingester_flush_queue_length`	Gauge	The total number of series pending in the flush queue.
`loki_chunk_store_index_entries_per_chunk`	Histogram	Number of index entries written to storage per chunk.
`loki_ingester_memory_chunks`	Gauge	The total number of chunks in memory.
`loki_ingester_memory_streams`	Gauge	The total number of streams in memory.
`loki_ingester_chunk_age_seconds`	Histogram	Distribution of chunk ages when flushed.
`loki_ingester_chunk_encode_time_seconds`	Histogram	Distribution of chunk encode times.
`loki_ingester_chunk_entries`	Histogram	Distribution of lines per-chunk when flushed.
`loki_ingester_chunk_size_bytes`	Histogram	Distribution of chunk sizes when flushed.
`loki_ingester_chunk_utilization`	Histogram	Distribution of chunk utilization (filled uncompressed bytes vs maximum uncompressed bytes) when flushed.
`loki_ingester_chunk_compression_ratio`	Histogram	Distribution of chunk compression ratio when flushed.
`loki_ingester_chunk_stored_bytes_total`	Counter	Total bytes stored in chunks per tenant.
`loki_ingester_chunks_created_total`	Counter	The total number of chunks created in the ingester.
`loki_ingester_chunks_stored_total`	Counter	Total stored chunks per tenant.
`loki_ingester_received_chunks`	Counter	The total number of chunks sent by this ingester whilst joining during the handoff process.
`loki_ingester_samples_per_chunk`	Histogram	The number of samples in a chunk.
`loki_ingester_sent_chunks`	Counter	The total number of chunks sent by this ingester whilst leaving during the handoff process.
`loki_ingester_streams_created_total`	Counter	The total number of streams created per tenant.
`loki_ingester_streams_removed_total`	Counter	The total number of streams removed per tenant.

Promtail exposes these metrics:

Metric Name	Metric Type	Description
`promtail_read_bytes_total`	Gauge	Number of bytes read.
`promtail_read_lines_total`	Counter	Number of lines read.
`promtail_dropped_bytes_total`	Counter	Number of bytes dropped because failed to be sent to the ingester after all retries.
`promtail_dropped_entries_total`	Counter	Number of log entries dropped because failed to be sent to the ingester after all retries.
`promtail_encoded_bytes_total`	Counter	Number of bytes encoded and ready to send.
`promtail_file_bytes_total`	Gauge	Number of bytes read from files.
`promtail_files_active_total`	Gauge	Number of active files.
`promtail_request_duration_seconds_count`	Histogram	Number of send requests.
`promtail_sent_bytes_total`	Counter	Number of bytes sent.
`promtail_sent_entries_total`	Counter	Number of log entries sent to the ingester.
`promtail_targets_active_total`	Gauge	Number of total active targets.
`promtail_targets_failed_total`	Counter	Number of total failed targets.

Most of these metrics are counters and should continuously increase during normal operations:

Your app emits a log line to a file that is tracked by Promtail.
Promtail reads the new line and increases its counters.
Promtail forwards the log line to a Loki distributor, where the received counters should increase.
The Loki distributor forwards the log line to a Loki ingester, where the request duration counter should increase.

If Promtail uses any pipelines with metrics stages, those metrics will also be exposed by Promtail at its /metrics endpoint. See Promtail’s documentation on Pipelines for more information.

An example Grafana dashboard was built by the community and is available as dashboard 10004.

Metrics cardinality

Some of the Loki observability metrics are emitted per tracked file (active), with the file path included in labels. This increases the quantity of label values across the environment, thereby increasing cardinality. Best practices with Prometheus labels discourage increasing cardinality in this way. Review your emitted metrics before scraping with Prometheus, and configure the scraping to avoid this issue.

Mixins

The Loki repository has a mixin that includes a set of dashboards, recording rules, and alerts. Together, the mixin gives you a comprehensive package for monitoring Loki in production.

For more information about mixins, take a look at the docs for the monitoring-mixins project.

Overrides Exporter

Mon, 14 Apr 2025 21:05:47 +0000

Loki is a multi-tenant system that supports applying limits to each tenant as a mechanism for resource management. The overrides-exporter module exposes these limits as Prometheus metrics in order to help operators better understand tenant behavior.

Context

Configuration updates to tenant limits can be applied to Loki without restart via the runtime_config feature.

Example

The overrides-exporter module is disabled by default. We recommend running a single instance per cluster to avoid issues with metric cardinality. The overrides-exporter creates one metric for every scalar field in the limits configuration under the metric loki_overrides_defaults with the default value for that field after loading the Loki configuration. It also exposes another metric for every differing field for every tenant.

Using an example runtime.yaml:

overrides:
  "tenant_1":
    ingestion_rate_mb: 10
    max_streams_per_user: 100000
    max_chunks_per_query: 100000

Launch an instance of the overrides-exporter:

loki -target=overrides-exporter -runtime-config.file=runtime.yaml -config.file=basic_schema_config.yaml -server.http-listen-port=8080

To inspect the tenant limit overrides:

$ curl -sq localhost:8080/metrics | grep override
# HELP loki_overrides Resource limit overrides applied to tenants
# TYPE loki_overrides gauge
loki_overrides{limit_name="ingestion_rate_mb",user="tenant_1"} 10
loki_overrides{limit_name="max_chunks_per_query",user="tenant_1"} 100000
loki_overrides{limit_name="max_streams_per_user",user="tenant_1"} 100000
# HELP loki_overrides_defaults Default values for resource limit overrides applied to tenants
# TYPE loki_overrides_defaults gauge
loki_overrides_defaults{limit_name="cardinality_limit"} 100000
loki_overrides_defaults{limit_name="creation_grace_period"} 6e+11
loki_overrides_defaults{limit_name="ingestion_burst_size_mb"} 6
loki_overrides_defaults{limit_name="ingestion_rate_mb"} 4
loki_overrides_defaults{limit_name="max_cache_freshness_per_query"} 6e+10
loki_overrides_defaults{limit_name="max_chunks_per_query"} 2e+06
loki_overrides_defaults{limit_name="max_concurrent_tail_requests"} 10
loki_overrides_defaults{limit_name="max_entries_limit_per_query"} 5000
loki_overrides_defaults{limit_name="max_global_streams_per_user"} 5000
loki_overrides_defaults{limit_name="max_label_name_length"} 1024
loki_overrides_defaults{limit_name="max_label_names_per_series"} 30
loki_overrides_defaults{limit_name="max_label_value_length"} 2048
loki_overrides_defaults{limit_name="max_line_size"} 0
loki_overrides_defaults{limit_name="max_queriers_per_tenant"} 0
loki_overrides_defaults{limit_name="max_query_length"} 2.5956e+15
loki_overrides_defaults{limit_name="max_query_lookback"} 0
loki_overrides_defaults{limit_name="max_query_parallelism"} 32
loki_overrides_defaults{limit_name="max_query_series"} 500
loki_overrides_defaults{limit_name="max_streams_matchers_per_query"} 1000
loki_overrides_defaults{limit_name="max_streams_per_user"} 0
loki_overrides_defaults{limit_name="min_sharding_lookback"} 0
loki_overrides_defaults{limit_name="per_stream_rate_limit"} 3.145728e+06
loki_overrides_defaults{limit_name="per_stream_rate_limit_burst"} 1.572864e+07
loki_overrides_defaults{limit_name="per_tenant_override_period"} 1e+10
loki_overrides_defaults{limit_name="reject_old_samples_max_age"} 1.2096e+15
loki_overrides_defaults{limit_name="retention_period"} 2.6784e+15
loki_overrides_defaults{limit_name="ruler_evaluation_delay_duration"} 0
loki_overrides_defaults{limit_name="ruler_max_rule_groups_per_tenant"} 0
loki_overrides_defaults{limit_name="ruler_max_rules_per_rule_group"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_batch_send_deadline"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_capacity"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_max_backoff"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_max_samples_per_send"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_max_shards"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_min_backoff"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_min_shards"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_timeout"} 0
loki_overrides_defaults{limit_name="split_queries_by_interval"} 0

Alerts can be created based on these metrics to inform operators when tenants are close to hitting their limits allowing for increases to be applied before the tenant limits are exceeded.

Storage

Mon, 14 Apr 2025 21:05:47 +0000

Grafana Loki Storage

High level storage overview here

Grafana Loki needs to store two different types of data: chunks and indexes.

Loki receives logs in separate streams, where each stream is uniquely identified by its tenant ID and its set of labels. As log entries from a stream arrive, they are compressed as “chunks” and saved in the chunks store. See chunk format for how chunks are stored internally.

The index stores each stream’s label set and links them to the individual chunks.

Refer to Loki’s configuration for details on how to configure the storage and the index.

For more information:

Supported Stores

The following are supported for the index:

Single Store (boltdb-shipper) - Recommended for 2.0 and newer index store which stores boltdb index files in the object store
Amazon DynamoDB
Google Bigtable
Apache Cassandra
BoltDB (doesn’t work when clustering Loki)

The following are supported for the chunks:

Amazon DynamoDB
Google Bigtable
Apache Cassandra
Amazon S3
Google Cloud Storage
Filesystem (read more about the filesystem to understand the pros/cons before using with production data)
Baidu Object Storage

Cloud Storage Permissions

S3

When using S3 as object storage, the following permissions are needed:

s3:ListBucket
s3:PutObject
s3:GetObject
s3:DeleteObject (if running the Single Store (boltdb-shipper) compactor)

Resources: arn:aws:s3:::<bucket_name>, arn:aws:s3:::<bucket_name>/*

DynamoDB

When using DynamoDB for the index, the following permissions are needed:

dynamodb:BatchGetItem
dynamodb:BatchWriteItem
dynamodb:DeleteItem
dynamodb:DescribeTable
dynamodb:GetItem
dynamodb:ListTagsOfResource
dynamodb:PutItem
dynamodb:Query
dynamodb:TagResource
dynamodb:UntagResource
dynamodb:UpdateItem
dynamodb:UpdateTable
dynamodb:CreateTable
dynamodb:DeleteTable (if table_manager.retention_period is more than 0s)

Resources: arn:aws:dynamodb:<aws_region>:<aws_account_id>:table/<prefix>*

dynamodb:ListTables

Resources: *

AutoScaling

If you enable autoscaling from table manager, the following permissions are needed:

Application Autoscaling

application-autoscaling:DescribeScalableTargets
application-autoscaling:DescribeScalingPolicies
application-autoscaling:RegisterScalableTarget
application-autoscaling:DeregisterScalableTarget
application-autoscaling:PutScalingPolicy
application-autoscaling:DeleteScalingPolicy

Resources: *

IAM

iam:GetRole
iam:PassRole

Resources: arn:aws:iam::<aws_account_id>:role/<role_name>

Chunk Format

  -------------------------------------------------------------------
  |                               |                                 |
  |        MagicNumber(4b)        |           version(1b)           |
  |                               |                                 |
  -------------------------------------------------------------------
  |         block-1 bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |         block-2 bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |         block-n bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |                        #blocks (uvarint)                        |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  |                      checksum(from #blocks)                     |
  -------------------------------------------------------------------
  |           metasOffset - offset to the point with #blocks        |
  -------------------------------------------------------------------

Loki Canary

Tue, 16 Jul 2024 15:42:20 +0000

Loki Canary

Loki Canary is a standalone app that audits the log-capturing performance of a Grafana Loki cluster.

Loki Canary generates artificial log lines. These log lines are sent to the Loki cluster. Loki Canary communicates with the Loki cluster to capture metrics about the artificial log lines, such that Loki Canary forms information about the performance of the Loki cluster. The information is available as Prometheus time series metrics.

Loki Canary writes a log to a file and stores the timestamp in an internal array. The contents look something like this:

1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp

The relevant part of the log entry is the timestamp; the ps are just filler bytes to make the size of the log configurable.

An agent (like Promtail) should be configured to read the log file and ship it to Loki.

Meanwhile, Loki Canary will open a WebSocket connection to Loki and will tail the logs it creates. When a log is received on the WebSocket, the timestamp in the log message is compared to the internal array.

If the received log is:

The next in the array to be received, it is removed from the array and the (current time - log timestamp) is recorded in the response_latency histogram. This is the expected behavior for well behaving logs.
Not the next in the array to be received, it is removed from the array, the response time is recorded in the response_latency histogram, and the out_of_order_entries counter is incremented.
Not in the array at all, it is checked against a separate list of received logs to either increment the duplicate_entries counter or the unexpected_entries counter.

In the background, Loki Canary also runs a timer which iterates through all of the entries in the internal array. If any of the entries are older than the duration specified by the -wait flag (defaulting to 60s), they are removed from the array and the websocket_missing_entries counter is incremented. An additional query is then made directly to Loki for any missing entries to determine if they are truly missing or only missing from the WebSocket. If missing entries are not found in the direct query, the missing_entries counter is incremented.

Additional Queries

Spot Check

Starting with version 1.6.0, the canary will spot check certain results over time to make sure they are present in Loki, this is helpful for testing the transition of inmemory logs in the ingester to the store to make sure nothing is lost.

-spot-check-interval and -spot-check-max are used to tune this feature, -spot-check-interval will pull a log entry from the stream at this interval and save it in a separate list up to -spot-check-max.

Every -spot-check-query-rate, Loki will be queried for each entry in this list and loki_canary_spot_check_entries_total will be incremented, if a result is missing loki_canary_spot_check_missing_entries_total will be incremented.

The defaults of 15m for spot-check-interval and 4h for spot-check-max means that after 4 hours of running the canary will have a list of 16 entries it will query every minute (default spot-check-query-rate interval is 1m), so be aware of the query load this can put on Loki if you have a lot of canaries.

NOTE: if you are using out-of-order-percentage to test ingestion of out-of-order log lines be sure not to set the two out of order time range flags too far in the past. The defaults are already enough to test this functionality properly, and setting them too far in the past can cause issues with the spot check test.

When using out-of-order-percentage you also need to make use of pipeline stages in your Promtail configuration in order to set the timestamps correctly as the logs are pushed to Loki. The client/promtail/pipelines docs have examples of how to do this.

Metric Test

Loki Canary will run a metric query count_over_time to verify that the rate of logs being stored in Loki corresponds to the rate they are being created by Loki Canary.

-metric-test-interval and -metric-test-range are used to tune this feature, but by default every 15m the canary will run a count_over_time instant-query to Loki for a range of 24h.

If the canary has not run for -metric-test-range (24h) the query range is adjusted to the amount of time the canary has been running such that the rate can be calculated since the canary was started.

The canary calculates what the expected count of logs would be for the range (also adjusting this based on canary runtime) and compares the expected result with the actual result returned from Loki. The difference is stored as the value in the gauge loki_canary_metric_test_deviation

It’s expected that there will be some deviation, the method of creating an expected calculation based on the query rate compared to actual query data is imperfect and will lead to a deviation of a few log entries.

It’s not expected for there to be a deviation of more than 3-4 log entries.

Control

Loki Canary responds to two endpoints to allow dynamic suspending/resuming of the canary process. This can be useful if you’d like to quickly disable or reenable the canary. To stop or start the canary issue an HTTP GET request against the /suspend or /resume endpoints.

Installation

Binary

Loki Canary is provided as a pre-compiled binary as part of the Loki Releases on GitHub.

Docker

Loki Canary is also provided as a Docker container image:

# change tag to the most recent release
$ docker pull grafana/loki-canary:2.0.0

Kubernetes

To run on Kubernetes, you can do something simple like:

kubectl run loki-canary --generator=run-pod/v1 --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=IfNotPresent --labels=name=loki-canary -- -addr=loki:3100

Or you can do something more complex like deploy it as a DaemonSet, there is a Tanka setup for this in the production folder, you can import it using jsonnet-bundler:

jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary

Then in your Tanka environment’s main.jsonnet you’ll want something like this:

local loki_canary = import 'loki-canary/loki-canary.libsonnet';

loki_canary {
  loki_canary_args+:: {
    addr: "loki:3100",
    port: 80,
    labelname: "instance",
    interval: "100ms",
    size: 1024,
    wait: "3m",
  },
  _config+:: {
    namespace: "default",
  }
}

Examples

Standalone Pod Implementation of loki-canary

---
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: loki-canary
    name: loki-canary
  name: loki-canary
spec:
  containers:
  - args:
    - -addr=loki:3100
    image: grafana/loki-canary:latest
    imagePullPolicy: IfNotPresent
    name: loki-canary
    resources: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki-canary
  labels:
    app: loki-canary
spec:
  type: ClusterIP
  selector:
    app: loki-canary
  ports:
  - name: metrics
    protocol: TCP
    port: 3500
    targetPort: 3500

DaemonSet Implementation of loki-canary

---
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  labels:
    app: loki-canary
    name: loki-canary
  name: loki-canary
spec:
  template:
    metadata:
      name: loki-canary
      labels:
        app: loki-canary
    spec:
      containers:
      - args:
        - -addr=loki:3100
        image: grafana/loki-canary:latest
        imagePullPolicy: IfNotPresent
        name: loki-canary
        resources: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki-canary
  labels:
    app: loki-canary
spec:
  type: ClusterIP
  selector:
    app: loki-canary
  ports:
  - name: metrics
    protocol: TCP
    port: 3500
    targetPort: 3500

From Source

If the other options are not sufficient for your use case, you can compile loki-canary yourself:

# clone the source tree
$ git clone https://github.com/grafana/loki

# build the binary
$ make loki-canary

# (optionally build the container image)
$ make loki-canary-image

Configuration

The address of Loki must be passed in with the -addr flag, and if your Loki server uses TLS, -tls=true must also be provided. Note that using TLS will cause the WebSocket connection to use wss:// instead of ws://.

The -labelname and -labelvalue flags should also be provided, as these are used by Loki Canary to filter the log stream to only process logs for the current instance of the canary. Ensure that the values provided to the flags are unique to each instance of Loki Canary. Grafana Labs’ Tanka config accomplishes this by passing in the pod name as the label value.

If Loki Canary reports a high number of unexpected_entries, Loki Canary may not be waiting long enough and the value for the -wait flag should be increased to a larger value than 60s.

Be aware of the relationship between pruneinterval and the interval. For example, with an interval of 10ms (100 logs per second) and a prune interval of 60s, you will write 6000 logs per minute. If those logs were not received over the WebSocket, the canary will attempt to query Loki directly to see if they are completely lost. However the query return is limited to 1000 results so you will not be able to return all the logs even if they did make it to Loki.

Likewise, if you lower the pruneinterval you risk causing a denial of service attack as all your canaries attempt to query for missing logs at whatever your pruneinterval is defined at.

All options:

  -addr string
        The Loki server URL:Port, e.g. loki:3100
  -buckets int
        Number of buckets in the response_latency histogram (default 10)
  -interval duration
        Duration between log entries (default 1s)
  -labelname string
        The label name for this instance of Loki Canary to use in the log selector
        (default "name")
  -labelvalue string
        The unique label value for this instance of Loki Canary to use in the log selector
        (default "loki-canary")
  -metric-test-interval duration
        The interval the metric test query should be run (default 1h0m0s)
  -metric-test-range duration
        The range value [24h] used in the metric test instant-query. This value is truncated
        to the running time of the canary until this value is reached (default 24h0m0s)
  -out-of-order-max duration
    	  Maximum amount of time (in seconds) in the past an out of order entry may have as a
          timestamp. (default 60s)
  -out-of-order-min duration
    	  Minimum amount of time (in seconds) in the past an out of order entry may have as a
          timestamp. (default 30s)
  -out-of-order-percentage int
      	Percentage (0-100) of log entries that should be sent out of order
  -pass string
        Loki password
  -port int
        Port which Loki Canary should expose metrics (default 3500)
  -pruneinterval duration
        Frequency to check sent versus received logs, and also the frequency at which queries
        for missing logs will be dispatched to Loki, and the frequency spot check queries are run
        (default 1m0s)
  -query-timeout duration
        How long to wait for a query response from Loki (default 10s)
  -size int
        Size in bytes of each log line (default 100)
  -spot-check-interval duration
        Interval that a single result will be kept from sent entries and spot-checked against
        Loki. For example, with the 15 minute default, one entry every 15 minutes will be saved,
        and then queried again every 15 minutes until the time defined by spot-check-max is
        reached (default 15m0s)
  -spot-check-max duration
        How far back to check a spot check an entry before dropping it (default 4h0m0s)
  -spot-check-query-rate duration
        Interval that Loki Canary will query Loki for the current list of all spot check entries
        (default 1m0s)
  -streamname string
        The stream name for this instance of Loki Canary to use in the log selector
        (default "stream")
  -streamvalue string
        The unique stream value for this instance of Loki Canary to use in the log selector
        (default "stdout")
  -tenant-id string
        Tenant ID to be set in X-Scope-OrgID header.
  -tls
        Does the Loki connection use TLS?
  -user string
        Loki user name
  -version
        Print this build's version information
  -wait duration
        Duration to wait for log entries before reporting them as lost (default 1m0s)

Shuffle sharding

Tue, 16 Jul 2024 15:42:20 +0000

Shuffle sharding

Shuffle sharding is a resource-management technique used to isolate tenant workloads from other tenant workloads, to give each tenant more of a single-tenant experience when running in a shared cluster. This technique is explained by AWS in their article Workload isolation using shuffle-sharding. A reference implementation has been shown in the Route53 Infima library.

The issues that shuffle sharding mitigates

Shuffle sharding can be configured for the query path.

The query path is sharded by default, and the default does not use shuffle sharding. Each tenant’s query is sharded across all queriers, so the workload uses all querier instances.

In a multi-tenant cluster, sharding across all instances of a component may exhibit these issues:

Any outage of a component instance affects all tenants
A misbehaving tenant affects all other tenants

An individual query may create issues for all tenants. A single tenant or a group of tenants may issue an expensive query: one that causes a querier component to hit an out-of-memory error, or one that causes a querier component to crash. Once the error occurs, the tenant or tenants issuing the error-causing query will be reassigned to other running queriers, up to the limit imposed by the max_queriers_per_tenant configuration. This, in turn, may affect the queriers that have been reassigned.

How shuffle sharding works

The idea of shuffle sharding is to assign each tenant to a shard composed by a subset of the Loki queriers, aiming to minimize the overlapping instances between distinct tenants.

A misbehaving tenant will affect only its shard’s queriers. Due to the low overlap of queriers among tenants, only a small subset of tenants will be affected bythe misbehaving tenant. Shuffle sharding requires no more resources than the default sharding strategy.

Shuffle sharding does not fix all issues. If a tenant repeatedly sends a problematic query, the crashed querier will be disconnected from the query-frontend, and a new querier will be immediately assigned to the tenant’s shard. This invalidates the positive effects of shuffle sharding. In this case, configuring a delay between when a querier disconnects because of a crash, and when the crashed querier is actually removed from the tenant’s shard and another healthy querier is added as a replacement improves the situation. A delay of 1 minute may be a reasonable value in the query-frontend with configuration parameter -query-frontend.querier-forget-delay=1m, and in the query-scheduler with configuration parameter -query-scheduler.querier-forget-delay=1m.

Low probability of overlapping instances

If an example Loki cluster runs 50 queriers and assigns each tenant 4 out of 50 queriers, shuffling instances between each tenant, there are 230K possible combinations.

Statistically, randomly picking two distinct tenants, there is:

a 71% chance that they will not share any instance
a 26% chance that they will share only 1 instance
a 2.7% chance that they will share 2 instances
a 0.08% chance that they will share 3 instances
only a 0.0004% chance that their instances will fully overlap

Configuration

Enable shuffle sharding by setting -frontend.max-queriers-per-tenant to a value higher than 0 and lower than the number of available queriers. The value of the per-tenant configuration max_queriers_per_tenant sets the quantity of allocated queriers. This option is only available when using the query-frontend, with or without a scheduler.

The per-tenant configuration parameter max_query_parallelism describes how many sub queries, after query splitting and query sharding, can be scheduled to run at the same time for each request of any tenant.

Configuration parameter querier.concurrency controls the quanity of worker threads (goroutines) per single querier.

The maximum number of queriers can be overridden on a per-tenant basis in the limits overrides configuration by max_queriers_per_tenant.

Shuffle sharding metrics

These metrics reveal information relevant to shuffle sharding:

the overall query-scheduler queue duration, cortex_query_scheduler_queue_duration_seconds_*
the query-scheduler queue length per tenant, cortex_query_scheduler_queue_length

the query-scheduler queue duration per tenant can be found with this query:

max_over_time({cluster="$cluster",container="query-frontend", namespace="$namespace"} |= "metrics.go" |logfmt | unwrap duration(queue_time) | __error__="" [5m]) by (org_id)

Too many spikes in any of these metrics may imply:

A particular tenant is trying to use more query resources than they were allocated.
That tenant may need an increase in the value of max_queriers_per_tenant.
Loki instances may be under provisioned.

A useful query checks how many queriers are being used by each tenant:

count by (org_id) (sum by (org_id, pod) (count_over_time({job="$namespace/querier", cluster="$cluster"} |= "metrics.go" | logfmt [$__interval])))

Recording Rules

Tue, 16 Jul 2024 15:42:20 +0000

Recording Rules

Recording rules are evaluated by the ruler component. Each ruler acts as its own querier, in the sense that it executes queries against the store without using the query-frontend or querier components. It will respect all query limits put in place for the querier.

Loki’s implementation of recording rules largely reuses Prometheus’ code.

Samples generated by recording rules are sent to Prometheus using Prometheus’ remote-write feature.

Write-Ahead Log (WAL)

All samples generated by recording rules are written to a WAL. The WAL’s main benefit is that it persists the samples generated by recording rules to disk, which means that if your ruler crashes, you won’t lose any data. We are trading off extra memory usage and slower start-up times for this functionality.

A WAL is created per tenant; this is done to prevent cross-tenant interactions. If all samples were to be written to a single WAL, this would increase the chances that one tenant could cause data-loss for others. A typical scenario here is that Prometheus will, for example, reject a remote-write request with 100 samples if just 1 of those samples is invalid in some way.

Start-up

When the ruler starts up, it will load the WALs for the tenants who have recording rules. These WAL files are stored on disk and are loaded into memory.

Note: WALs are loaded one at a time upon start-up. This is a current limitation of the Loki ruler. For this reason, it is adviseable that the number of rule groups serviced by a ruler be kept to a reasonable size, since no rule evaluation occurs while WAL replay is in progress (this includes alerting rules).

Truncation

WAL files are regularly truncated to reduce their size on disk. This guide from one of the Prometheus maintainers (Ganesh Vernekar) gives an excellent overview of the truncation, checkpointing, and replaying of the WAL.

Cleaner

WAL Cleaner is an experimental feature.

The WAL Cleaner watches for abandoned WALs (tenants who no longer have recording rules associated) and deletes them. Enable this feature only if you are running into storage concerns with WALs that are too large. WALs should not grow excessively large due to truncation.

Scaling

See Mimir’s guide for configuring Grafana Mimir hash rings for scaling the ruler using a ring.

Note: the ruler shards by rule group, not by individual rules. This is an artifact of the fact that Prometheus recording rules need to run in order since one recording rule can reuse another - but this is not possible in Loki.

Deployment

The ruler needs to persist its WAL files to disk, and it incurs a bit of a start-up cost by reading these WALs into memory. As such, it is recommended that you try to minimize churn of individual ruler instances since rule evaluation is blocked while the WALs are being read from disk.

Kubernetes

It is recommended that you run the rulers using StatefulSets. The ruler will write its WAL files to persistent storage, so a Persistent Volume should be utilised.

Remote-Write

Per-Tenant Limits

Remote-write can be configured at a global level in the base configuration, and certain parameters tuned specifically on a per-tenant basis. Most of the configuration options defined here have override options (which can be also applied at runtime!).

Tuning

Remote-write can be tuned if the default configuration is insufficient (see Failure Modes below).

There is a guide on the Prometheus website, all of which applies to Loki, too.

Observability

Since Loki reuses the Prometheus code for recording rules and WALs, it also gains all of Prometheus’ observability.

Prometheus exposes a number of metrics for its WAL implementation, and these have all been prefixed with loki_ruler_wal_.

For example: prometheus_remote_storage_bytes_total → loki_ruler_wal_prometheus_remote_storage_bytes_total

Additional metrics are exposed, also with the prefix loki_ruler_wal_. All per-tenant metrics contain a tenant label, so be aware that cardinality could begin to be a concern if the number of tenants grows sufficiently large.

Some key metrics to note are:

loki_ruler_wal_appender_ready: whether a WAL appender is ready to accept samples (1) or not (0)
loki_ruler_wal_prometheus_remote_storage_samples_total: number of samples sent per tenant to remote storage
loki_ruler_wal_prometheus_remote_storage_samples...
- loki_ruler_wal_prometheus_remote_storage_samples_pending_total: samples buffered in memory, waiting to be sent to remote storage
- loki_ruler_wal_prometheus_remote_storage_samples_failed_total: samples that failed when sent to remote storage
- loki_ruler_wal_prometheus_remote_storage_samples_dropped_total: samples dropped by relabel configurations
- loki_ruler_wal_prometheus_remote_storage_samples_retried_total: samples re-resent to remote storage
loki_ruler_wal_prometheus_remote_storage_highest_timestamp_in_seconds: highest timestamp of sample appended to WAL
loki_ruler_wal_prometheus_remote_storage_queue_highest_sent_timestamp_seconds: highest timestamp of sample sent to remote storage.

We’ve created a basic dashboard in our loki-mixin which you can use to administer recording rules.

Failure Modes

Remote-Write Lagging

Remote-write can lag behind for many reasons:

Remote-write storage (Prometheus) is temporarily unavailable
A tenant is producing samples too quickly from a recording rule
Remote-write is tuned too low, creating backpressure

It can be determined by subtracting loki_ruler_wal_prometheus_remote_storage_queue_highest_sent_timestamp_seconds from loki_ruler_wal_prometheus_remote_storage_highest_timestamp_in_seconds.

In case 1, the ruler will continue to retry sending these samples until the remote storage becomes available again. Be aware that if the remote storage is down for longer than ruler.wal.max-age, data loss may occur after truncation occurs.

In cases 2 & 3, you should consider tuning remote-write appropriately.

Further reading: see this blog post by Prometheus maintainer Callum Styan.

Appender Not Ready

Each tenant’s WAL has an “appender” internally; this appender is used to append samples to the WAL. The appender is marked as not ready until the WAL replay is complete upon startup. If the WAL is corrupted for some reason, or is taking a long time to replay, you can determine this by alerting on loki_ruler_wal_appender_ready < 1.

Corrupt WAL

If a disk fails or the ruler does not terminate correctly, there’s a chance one or more tenant WALs can become corrupted. A mechanism exists for automatically repairing the WAL, but this cannot handle every conceivable scenario. In this case, the loki_ruler_wal_corruptions_repair_failed_total metric will be incremented.

Found another failure mode?

Please open an issue and tell us about it!