Manage Loki on Grafana Labs

Audit data propagation latency and correctness using Loki Canary

Thu, 09 Apr 2026 02:28:18 +0000

Audit data propagation latency and correctness using Loki Canary

Loki Canary is a standalone app that audits the log-capturing performance of a Grafana Loki cluster.
This component emits and periodically queries for logs, making sure that Loki is ingesting logs without any data loss. When something is wrong with Loki, the Canary often provides the first indication.

Loki Canary generates artificial log lines. These log lines are sent to the Loki cluster. Loki Canary communicates with the Loki cluster to capture metrics about the artificial log lines, such that Loki Canary forms information about the performance of the Loki cluster. The information is available as Prometheus time series metrics.

Loki Canary writes a log to standard output and stores the timestamp in an internal array. The contents look something like this:

1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp

The relevant part of the log entry is the timestamp; the ps are just filler bytes to make the size of the log configurable.

Loki Canary’s standard output should be captured and written to a file. An agent (like Grafana Alloy) should be configured to read the log file and ship it to Loki.

Meanwhile, Loki Canary will open a WebSocket connection to Loki and will tail the logs it creates. When a log is received on the WebSocket, the timestamp in the log message is compared to the internal array.

If the received log is:

The next in the array to be received, it is removed from the array and the (current time - log timestamp) is recorded in the response_latency histogram. This is the expected behavior for well behaving logs.
Not the next in the array to be received, it is removed from the array, the response time is recorded in the response_latency histogram, and the out_of_order_entries counter is incremented.
Not in the array at all, it is checked against a separate list of received logs to either increment the duplicate_entries counter or the unexpected_entries counter.

In the background, Loki Canary also runs a timer which iterates through all of the entries in the internal array. If any of the entries are older than the duration specified by the -wait flag (defaulting to 60s), they are removed from the array and the websocket_missing_entries counter is incremented. An additional query is then made directly to Loki for any missing entries to determine if they are truly missing or only missing from the WebSocket. If missing entries are not found in the direct query, the missing_entries counter is incremented.

Additional Queries

Spot Check

Starting with version 1.6.0, the canary will spot check certain results over time to make sure they are present in Loki, this is helpful for testing the transition of inmemory logs in the ingester to the store to make sure nothing is lost.

-spot-check-interval and -spot-check-max are used to tune this feature, -spot-check-interval will pull a log entry from the stream at this interval and save it in a separate list up to -spot-check-max.

Every -spot-check-query-rate, Loki will be queried for each entry in this list and loki_canary_spot_check_entries_total will be incremented, if a result is missing loki_canary_spot_check_missing_entries_total will be incremented.

The defaults of 15m for spot-check-interval and 4h for spot-check-max means that after 4 hours of running the canary will have a list of 16 entries it will query every minute (default spot-check-query-rate interval is 1m), so be aware of the query load this can put on Loki if you have a lot of canaries.

NOTE: if you are using out-of-order-percentage to test ingestion of out-of-order log lines be sure not to set the two out of order time range flags too far in the past. The defaults are already enough to test this functionality properly, and setting them too far in the past can cause issues with the spot check test.

When using out-of-order-percentage you also need to make use of pipeline stages in your Alloy configuration in order to set the timestamps correctly as the logs are pushed to Loki. The Alloy loki.process docs have examples of how to do this.

Metric Test

Loki Canary will run a metric query count_over_time to verify that the rate of logs being stored in Loki corresponds to the rate they are being created by Loki Canary.

-metric-test-interval and -metric-test-range are used to tune this feature, but by default every 15m the canary will run a count_over_time instant-query to Loki for a range of 24h.

If the canary has not run for -metric-test-range (24h) the query range is adjusted to the amount of time the canary has been running such that the rate can be calculated since the canary was started.

The canary calculates what the expected count of logs would be for the range (also adjusting this based on canary runtime) and compares the expected result with the actual result returned from Loki. The difference is stored as the value in the gauge loki_canary_metric_test_deviation

It’s expected that there will be some deviation, the method of creating an expected calculation based on the query rate compared to actual query data is imperfect and will lead to a deviation of a few log entries.

It’s not expected for there to be a deviation of more than 3-4 log entries.

Control

Loki Canary responds to two endpoints to allow dynamic suspending/resuming of the canary process. This can be useful if you’d like to quickly disable or reenable the canary. To stop or start the canary issue an HTTP GET request against the /suspend or /resume endpoints.

Installation

Binary

Loki Canary is provided as a pre-compiled binary as part of the Loki Releases on GitHub.

Docker

Loki Canary is also provided as a Docker container image:

# change tag to the most recent release
$ docker pull grafana/loki-canary:3.7.1

Kubernetes

To run on Kubernetes, you can do something simple like:

kubectl run loki-canary --generator=run-pod/v1 --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=IfNotPresent --labels=name=loki-canary -- -addr=loki:3100

Or you can do something more complex like deploy it as a DaemonSet, there is a Tanka setup for this in the production folder, you can import it using jsonnet-bundler:

jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary

Then in your Tanka environment’s main.jsonnet you’ll want something like this:

local loki_canary = import 'loki-canary/loki-canary.libsonnet';

loki_canary {
  loki_canary_args+:: {
    addr: "loki:3100",
    port: 80,
    labelname: "instance",
    interval: "100ms",
    size: 1024,
    wait: "3m",
  },
  _config+:: {
    namespace: "default",
  }
}

Examples

Standalone Pod Implementation of loki-canary

---
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: loki-canary
    name: loki-canary
  name: loki-canary
spec:
  containers:
  - args:
    - -addr=loki:3100
    image: grafana/loki-canary:latest
    imagePullPolicy: IfNotPresent
    name: loki-canary
    resources: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki-canary
  labels:
    app: loki-canary
spec:
  type: ClusterIP
  selector:
    app: loki-canary
  ports:
  - name: metrics
    protocol: TCP
    port: 3500
    targetPort: 3500

DaemonSet Implementation of loki-canary

---
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  labels:
    app: loki-canary
    name: loki-canary
  name: loki-canary
spec:
  template:
    metadata:
      name: loki-canary
      labels:
        app: loki-canary
    spec:
      containers:
      - args:
        - -addr=loki:3100
        image: grafana/loki-canary:latest
        imagePullPolicy: IfNotPresent
        name: loki-canary
        resources: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki-canary
  labels:
    app: loki-canary
spec:
  type: ClusterIP
  selector:
    app: loki-canary
  ports:
  - name: metrics
    protocol: TCP
    port: 3500
    targetPort: 3500

From Source

If the other options are not sufficient for your use case, you can compile loki-canary yourself:

Clone the source tree.

$ git clone https://github.com/grafana/loki

Build the binary.
Bash
```
$ make loki-canary
```
Optional: Build the container image.
Bash
```
$ make loki-canary-image
```

Configuration

The address of Loki must be passed in with the -addr flag or by setting the environment variable LOKI_ADDRESS, and if your Loki server uses TLS, -tls=true must also be provided. Note that using TLS will cause the WebSocket connection to use wss:// instead of ws://.

The -labelname and -labelvalue flags should also be provided, as these are used by Loki Canary to filter the log stream to only process logs for the current instance of the canary. Ensure that the values provided to the flags are unique to each instance of Loki Canary. Grafana Labs’ Tanka config accomplishes this by passing in the Pod name as the label value.

If Loki Canary reports a high number of unexpected_entries, Loki Canary may not be waiting long enough and the value for the -wait flag should be increased to a larger value than 60s.

Be aware of the relationship between pruneinterval and the interval. For example, with an interval of 10ms (100 logs per second) and a prune interval of 60s, you will write 6000 logs per minute. If those logs were not received over the WebSocket, the canary will attempt to query Loki directly to see if they are completely lost. However the query return is limited to 1000 results so you will not be able to return all the logs even if they did make it to Loki.

Likewise, if you lower the pruneinterval you risk causing a denial of service attack as all your canaries attempt to query for missing logs at whatever your pruneinterval is defined at.

All options:

  -addr string
    	The Loki server URL:Port, e.g. loki:3100. Loki address can also be set using the environment variable LOKI_ADDRESS.
  -buckets int
    	Number of buckets in the response_latency histogram (default 10)
  -ca-file string
    	Client certificate authority for optional use with TLS connection to Loki
  -cert-file string
    	Client PEM encoded X.509 certificate for optional use with TLS connection to Loki
  -insecure
    	Allow insecure TLS connections
  -interval duration
    	Duration between log entries (default 1s)
  -key-file string
    	Client PEM encoded X.509 key for optional use with TLS connection to Loki
  -labels string
        Comma-separated string of labels for the query e.g. 'service=loki,app=canary'. The parsing logic for this argument is simple, label values must not contain a comma or special characters and should not be quoted. Overwrites labelname and streamname
  -labelname string
    	The label name for this instance of loki-canary to use in the log selector (default "name")
  -labelvalue string
    	The unique label value for this instance of loki-canary to use in the log selector (default "loki-canary")
  -max-wait duration
    	Duration to keep querying Loki for missing websocket entries before reporting them missing (default 5m0s)
  -metric-test-interval duration
    	The interval the metric test query should be run (default 1h0m0s)
  -metric-test-range duration
    	The range value [24h] used in the metric test instant-query. Note: this value is truncated to the running time of the canary until this value is reached (default 24h0m0s)
  -out-of-order-max duration
    	Maximum amount of time to go back for out of order entries (in seconds). (default 1m0s)
  -out-of-order-min duration
    	Minimum amount of time to go back for out of order entries (in seconds). (default 30s)
  -out-of-order-percentage int
    	Percentage (0-100) of log entries that should be sent out of order.
  -pass string
    	Loki password. This credential should have both read and write permissions to Loki endpoints
  -port int
    	Port which loki-canary should expose metrics (default 3500)
  -pruneinterval duration
    	Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki (default 1m0s)
  -push
    	Push the logs directly to given Loki address
  -query-append string
        LogQL filters to be appended to the Canary query e.g. '| json | line_format `{{.log}}`'  	
  -query-timeout duration
    	How long to wait for a query response from Loki (default 10s)
  -size int
    	Size in bytes of each log line (default 100)
  -spot-check-initial-wait duration
    	How long should the spot check query wait before starting to check for entries (default 10s)
  -spot-check-interval duration
    	Interval that a single result will be kept from sent entries and spot-checked against Loki, e.g. 15min default one entry every 15 min will be saved and then queried again every 15min until spot-check-max is reached (default 15m0s)
  -spot-check-max duration
    	How far back to check a spot check entry before dropping it (default 4h0m0s)
  -spot-check-query-rate duration
    	Interval that the canary will query Loki for the current list of all spot check entries (default 1m0s)
  -streamname string
    	The stream name for this instance of loki-canary to use in the log selector (default "stream")
  -streamvalue string
    	The unique stream value for this instance of loki-canary to use in the log selector (default "stdout")
  -tenant-id string
    	Tenant ID to be set in X-Scope-OrgID header.
  -tls
    	Does the loki connection use TLS?
  -user string
    	Loki username.
  -version
    	Print this builds version information
  -wait duration
    	Duration to wait for log entries on websocket before querying loki for them (default 1m0s)
  -write-max-backoff duration
    	Maximum backoff time between retries  (default 5m0s)
  -write-max-retries int
    	Maximum number of retries when push a log entry  (default 10)
  -write-min-backoff duration
    	Initial backoff time before first retry  (default 500ms)
  -write-timeout duration
    	How long to wait write response from Loki (default 10s)

Monolithic mode setup

This section describes how to set up Loki Canary for Loki’s monolithic mode using Systemd, Alloy, and Prometheus.

Systemd

Create a systemd service file that writes Loki Canary’s standard output to the file /var/log/loki-canary.log.

[Unit]
Description=Loki Canary
Documentation=https://grafana.com/docs/loki/latest/operations/loki-canary/

[Service]
User=loki
ExecStart=/usr/bin/loki-canary -addr=localhost:3100 -labelname=job -labelvalue=loki_canary -streamname=job -streamvalue=loki_canary
Restart=on-failure
RestartSec=5
StandardOutput=append:/var/log/loki-canary.log
StandardError=journal

[Install]
WantedBy=multi-user.target

-labelname and -labelvalue flags specify a label pair used to identify Loki Canary’s logs. -streamname and -streamvalue flags specify an additional label pair and must be provided. The same values can be provided to both label pairs if no additional label exists. Labels can be added when Alloy scrapes the logs.

Scrape logs

Scrape the /var/log/loki-canary.log file with Alloy.

loki.source.file "canary" {
  forward_to = [loki.write.local.receiver]
  targets = [{
    __path__ = "/var/log/loki-canary.log",
    job      = "loki_canary",
  }]
}

loki.write "local" {
  endpoint {
    url  = "http://localhost:3100/loki/api/v1/push"
  }
}

Scrape metrics

Scrape Loki Canary’s metrics with Alloy or Prometheus.

Scrape metrics with Alloy

prometheus.scrape "loki" {
  targets    = [{__address__ = "localhost:3100"}]
  forward_to = [prometheus.remote_write.default.receiver]
}

prometheus.remote_write "default" {
  endpoint {  
    url = "<PROMETHEUS_REMOTE_WRITE_URL>"
  }  
}

Scrape metrics with Prometheus

scrape_configs:
  - job_name: loki-canary
    static_configs:
      - targets: ['localhost:3500']

Block unwanted queries

Thu, 09 Apr 2026 02:28:18 +0000

Block unwanted queries

In certain situations, you may not be able to control the queries being sent to your Loki installation. These queries may be intentionally or unintentionally expensive to run, and they may affect the overall stability or cost of running your service.

You can block queries using per-tenant overrides, like so:

overrides:
  "tenant-id":
    blocked_queries:
      # block this query exactly
      - pattern: 'sum(rate({env="prod"}[1m]))'

      # block any query matching this regex pattern 
      - pattern: '.*prod.*'
        regex: true

      # block all metric queries
      - types: metric

      # block any filter or limited queries matching this regex pattern 
      - pattern: '.*prod.*'
        regex: true
        types: filter,limited

      # block any query that matches this query hash
      - hash: 2943214005          # hash of {stream="stdout",pod="loki-canary-9w49x"}
        types: filter,limited

      # block queries originating from specific sources via X-Query-Tags
      # Keys and values are matched case-insensitively.
      - pattern: '.*'             # optional; if pattern and regex are omittied they will default to '.*' and true
        regex: true
        query_tags:
          source: grafana
          feature: beta

Note
Changes to these configurations do not require a restart; they are defined in the runtime configuration file.

The available query types are:

metric: a query with an aggregation, e.g. sum(rate({env="prod"}[1m]))
filter: a query with a log filter, e.g. {env="prod"} |= "error"
limited: a query without a filter or a metric aggregation

The hash option uses a 32-bit FNV-1 hash of the query string, represented as a 32-bit unsigned integer. This can often be easier to use than query strings that are long or require lots of string escaping. A query_hash field is logged with every query request in the query-frontend and querier logs, for easy reference. Here’s an example log line:

level=info ts=2023-03-30T09:08:15.2614555Z caller=metrics.go:152 component=frontend org_id=29 latency=fast 
query="{stream=\"stdout\",pod=\"loki-canary-9w49x\"}" query_hash=2943214005 query_type=limited range_type=range ...

Note
The order of patterns is preserved, so the first matching pattern will be used.

Observing blocked queries

Blocked queries are logged, as well as counted in the loki_blocked_queries metric on a per-tenant basis.

When a policy matches by pattern/hash/regex, Loki logs whether the query type and request tags matched that policy:

level=warn msg="query blocker matched with regex policy" user=29 type=metric pattern=".*rate\\(.*\\).*" query="sum(rate({app=\"foo\"}[5m]))" typesMatched=true tagsMatched=false blocked=false

If tag constraints fail to match, Loki emits a debug log showing the missing key and the raw header value that was received:

level=debug msg="query blocker tags mismatch: missing or mismatched key" key=feature tagsRaw="Source=grafana,Feature=alpha"

Scope

Queries received via the API and executed as alerting/recording rules will be blocked.

Tag-based blocking

You can scope a blocked query rule to requests that include specific key=value pairs in the X-Query-Tags header.

Header format: key=value pairs separated by commas, for example: Source=grafana,Feature=beta.
Allowed characters are alphanumeric plus space, comma, equals, ‘@’, ‘.’, and ‘-’. Any other characters are replaced with _.
Parsing keeps only canonical key=value tokens; malformed tokens are ignored.
Matching rules:
- Keys are matched case-insensitively (the server lowercases keys).
- Values are matched case-insensitively.
- All specified tags: pairs in the rule must be present in the request to apply the block.

Examples:

overrides:
  tenant-a:
    blocked_queries:
      # Block only metric queries from a beta feature flag
      - types: metric
        query_tags:
          feature: beta

      # Combine with regex to narrow scope further
      - pattern: '.*rate\\(.*\\).*'
        regex: true
        query_tags:
          source: grafana

Configure caches to speed up queries

Thu, 09 Apr 2026 02:28:18 +0000

Configure caches to speed up queries

Loki supports two types of caching for query results and chunks to speed up query performance and reduce calls to the storage layer. Memcached is included in the Loki Helm chart and enabled by default for the chunksCache and resultsCache. This sections describes the recommended Memcached configuration to enable caching for chunks and query results.

Results cache

The results cache stores the results for index-stat, instant-metric, label and volume queries and it supports negative caching for log queries. It is sometimes called frontend cache in some configurations. For details of each supported request type, refer to the Components section. The results cache is consulted by query-frontends to be used in subsequent queries. If the cached results are incomplete, the query frontend calculates the required sub-queries and sends them further along to be executed in queriers, then also caches those results. To orchestrate all of the above, the results cache uses a query hash as the key that is computed and stored in the headers.

The index lookup cache only supports the legacy BoltDB index storage and is configured to be in-memory by default. Since moving to the TSDB indexes the attached disks/persistent volumes are utilised as cache and in-memory index lookup cache is obsolete.

Chunks cache

The chunks are cached using the chunkRef as the cache key, which is the unique reference to a chunk when it’s cut in the Loki ingesters. The chunk cache is consulted by queriers each time a set of chunkRefs are calculated to serve the query, before going to the storage layer.

Query results are significantly smaller compared to chunks. As the Loki cluster gets bigger in ingested volume, the results cache can continue to perform, whereas the chunks cache will need to grow in proportion to demand more memory. To be able to support the growing needs of a cluster, in 2023 we introduced support for memcached-extstore. Extstore is an additional feature on Memcached which supports attaching SSD disks to memcached pods to maximize their capacity.

Please see this blog post on Loki’s experience with memcached-extstore for our SaaS offfering, Grafana Cloud. For more information on how to tune memcached-extstore please consult the open source memcached documentation.

Before you begin

It is recommended to deploy separate Memcached type as separate components (memcached_frontend and memcached_chunks).
As of 2025-02-06, the memcached:1.6.32-alpine version of the library is recommended.
Consult the Loki ksonnet memcached deployment and the ksonnet memcached library.
Index caching is not required for the TSDB index format.
For recommendations on scaling the cache, refer to the Size the cluster page.

Steps

To enable and configure Memcached:

Deploy each Memcached service with at least three replicas and configure each as follows:
1. Chunk cache
```
--memory-limit=4096 --max-item-size=2m --conn-limit=1024
```
2. Query result cache
```
--memory-limit=1024 --max-item-size=5m --conn-limit=1024
```

Configure Loki to use the cache.

If the Helm chart is used

Set memcached.chunk_cache.host to the Memcached address for the chunk cache, memcached.results_cache.host to the Memcached address for the query result cache, memcached.chunk_cache.enabled=true and memcached.results_cache.enabled=true.

Ensure that the connection limit of Memcached is at least number_of_clients * max_idle_conns.

The options host and service depend on the type of installation. For example, using the bitnami/memcached Helm Charts with the following commands, the service values are always memcached.
```
helm upgrade --install chunk-cache -n loki bitnami/memcached -f memcached-overrides-chunk.yaml
helm upgrade --install results-cache -n loki bitnami/memcached -f memcached-overrides-results.yaml
```
The current Helm Chart only supports the chunk and results cache.

In this case, the Loki configuration would be
YAML
```
loki:
  memcached:
    chunk_cache:
      enabled: true
      host: chunk-cache-memcached.loki.svc
      service: memcached-client
      batch_size: 256
      parallelism: 10
    results_cache:
      enabled: true
      host: results-cache-memcached.loki.svc
      service: memcached-client
      default_validity: 12h
```

If the Loki configuration is used, modify the following two sections in the Loki configuration file.

Configure the chunk cache

chunk_store_config:
  chunk_cache_config:
    memcached:
      batch_size: 256
      parallelism: 10
    memcached_client:
      host: <chunk cache memcached host>
      service: <port name of memcached service>

Configure the query result cache

query_range:
  cache_results: true
  results_cache:
    cache:
      memcached_client:
        consistent_hash: true
        host: <memcached host>
        service: <port name of memcached service>
        max_idle_conns: 16
        timeout: 200ms
        update_interval: 1m

Enforce rate limits and push request validation

Thu, 09 Apr 2026 02:28:18 +0000

Enforce rate limits and push request validation

Loki will reject requests if they exceed a usage threshold (rate limit error) or if they are invalid (validation error).

All occurrences of these errors can be observed using the loki_discarded_samples_total and loki_discarded_bytes_total metrics. The sections below describe the various possible reasons specified in the reason label of these metrics.

It is recommended that Loki operators set up alerts or dashboards with these metrics to detect when rate limits or validation errors occur.

Terminology

sample: a log line with structured metadata
stream: samples with a unique combination of labels
active stream: streams that are present in the ingesters - these have recently received log lines within the chunk_idle_period period (default: 30m)

Rate-Limit Errors

Rate-limits are enforced when Loki cannot handle more requests from a tenant.

`rate_limited`

This rate limit is enforced when a tenant has exceeded their configured log ingestion rate limit.

One solution if you’re seeing samples dropped due to rate_limited is simply to increase the rate limits on your Loki cluster. These limits can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. The config options to use are ingestion_rate_mb and ingestion_burst_size_mb.

Note that you’ll want to make sure your Loki cluster has sufficient resources provisioned to be able to accommodate these higher limits. Otherwise your cluster may experience performance degradation as it tries to handle this higher volume of log lines to ingest.

Another option to address samples being dropped due to rate_limits is simply to decrease the rate of log lines being sent to your Loki cluster. Consider collecting logs from fewer targets or setting up drop stages in Alloy to filter out certain log lines. You can also use Alloy’s rate limiting to control the volume of logs sent to your Loki cluster.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	Yes
Sample discarded	No
Configurable per tenant	Yes
HTTP status code	`429 Too Many Requests`

`per_stream_rate_limit`

This limit is enforced when a single stream reaches its rate limit.

Each stream has a rate limit applied to it to prevent individual streams from overwhelming the set of ingesters it is distributed to (the size of that set is equal to the replication_factor value).

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. The config options to adjust are per_stream_rate_limit and per_stream_rate_limit_burst.

Another option you could consider to decrease the rate of samples dropped due to per_stream_rate_limit is to split the stream that is getting rate limited into several smaller streams. A third option is to use the Alloy stage.limit block to limit the rate of samples sent to the stream hitting the per_stream_rate_limit.

We typically recommend setting per_stream_rate_limit no higher than 5MB, and per_stream_rate_limit_burst no higher than 20MB.

Property	Value
Enforced by	`ingester`
Outcome	Request rejected
Retryable	Yes
Sample discarded	No
Configurable per tenant	Yes
HTTP status code	`429 Too Many Requests`

`stream_limit`

This limit is enforced when a tenant reaches their maximum number of active streams.

Active streams are held in memory buffers in the ingesters, and if this value becomes sufficiently large then it will cause the ingesters to run out of memory.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. To increase the allowable active streams, adjust max_global_streams_per_user. Alternatively, the number of active streams can be reduced by removing extraneous labels or removing excessive unique label values.

Property	Value
Enforced by	`ingester`
Outcome	Request rejected
Retryable	Yes
Sample discarded	No
Configurable per tenant	Yes
HTTP status code	`429 Too Many Requests`

Validation Errors

Validation errors occur when a request violates a validation rule defined by Loki.

`line_too_long`

This error occurs when a log line exceeds the maximum allowable length in bytes. The HTTP response will include the stream to which the offending log line belongs as well as its size in bytes.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. To increase the maximum line size, adjust max_line_size. We recommend that you do not increase this value above 256kb for performance reasons. Alternatively, Loki can be configured to ingest truncated versions of log lines over the length limit by using the max_line_size_truncate option.

Property	Value
Enforced by	`distributor`
Retryable	No
Sample discarded	Yes
Configurable per tenant	Yes

`invalid_labels`

This error occurs when one or more labels in the submitted streams fail validation.

Loki uses the same validation rules as Prometheus for validating labels.

Label names may contain ASCII letters, numbers, as well as underscores. They must match the regex [a-zA-Z_][a-zA-Z0-9_]*. Label names beginning with __ are reserved for internal use.

Property	Value
Enforced by	`distributor`
Retryable	No
Sample discarded	Yes
Configurable per tenant	No

`missing_labels`

This validation error is returned when a stream is submitted without any labels.

Property	Value
Enforced by	`distributor`
Retryable	No
Sample discarded	Yes
Configurable per tenant	No

`too_far_behind` and `out_of_order`

The too_far_behind and out_of_order reasons are identical. Loki clusters with unordered_writes=true (the default value as of Loki v2.4) use reason=too_far_behind. Loki clusters with unordered_writes=false use reason=out_of_order.

This validation error is returned when a stream is submitted out of order. More details can be found here about the Loki ordering constraints.

The unordered_writes config value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file, whereas max_chunk_age is a global configuration.

This problem can be solved by ensuring that log delivery is configured correctly, or by increasing the max_chunk_age value.

It is recommended to resist modifying the default value of max_chunk_age as this has other implications, and to instead try track down the cause for delayed logged delivery. It should also be noted that this a per-stream error, so by simply splitting streams (adding more labels) this problem can be circumvented, especially if multiple hosts are sending samples for a single stream.

Property	Value
Enforced by	`ingester`
Retryable	No
Sample discarded	Yes
Configurable per tenant	No

`greater_than_max_sample_age`

If the reject_old_samples config option is set to true (it is by default), then samples will be rejected with reason=greater_than_max_sample_age if they are older than the reject_old_samples_max_age value. You should not see samples rejected for reason=greater_than_max_sample_age if reject_old_samples=false.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. This error can be solved by increasing the reject_old_samples_max_age value, or investigating why log delivery is delayed for this particular stream. The stream in question will be returned in the body of the HTTP response.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	No
Sample discarded	Yes
Configurable per tenant	Yes
HTTP status code	`400 Bad Request`

`too_far_in_future`

If a sample’s timestamp is greater than the current timestamp, Loki allows for a certain grace period during which samples will be accepted. If the grace period is exceeded, the error will occur.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. This error can be solved by increasing the creation_grace_period value, or investigating why this particular stream has a timestamp too far into the future. The stream in question will be returned in the body of the HTTP response.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	No
Sample discarded	Yes
Configurable per tenant	Yes
HTTP status code	`400 Bad Request`

`max_label_names_per_series`

If a sample is submitted with more labels than Loki has been configured to allow, it will be rejected with the max_label_names_per_series reason. Note that ‘series’ is the same thing as a ‘stream’ in Loki - the ‘series’ term is a legacy name.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. This error can be solved by increasing the max_label_names_per_series value. The stream to which the offending sample (i.e. the one with too many label names) belongs will be returned in the body of the HTTP response.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	No
Sample discarded	Yes
Configurable per tenant	Yes
HTTP status code	`400 Bad Request`

`label_name_too_long`

If a sample is sent with a label name that has a length in bytes greater than Loki has been configured to allow, it will be rejected with the label_name_too_long reason.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. This error can be solved by increasing the max_label_name_length value, though we do not recommend raising it significantly above the default value of 1024 for performance reasons. The offending stream will be returned in the body of the HTTP response.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	No
Sample discarded	Yes
Configurable per tenant	Yes
HTTP status code	`400 Bad Request`

`label_value_too_long`

If a sample has a label value with a length in bytes greater than Loki has been configured to allow, it will be rejected for the label_value_too_long reason.

This value can be modified globally in the limits_config block, or on a per-tenant basis in the runtime overrides file. This error can be solved by increasing the max_label_value_length value. The offending stream will be returned in the body of the HTTP response.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	No
Sample discarded	Yes
Configurable per tenant	Yes
HTTP status code	`400 Bad Request`

`duplicate_label_names`

If a sample is sent with two or more identical labels, it will be rejected for the duplicate_label_names reason.

The offending stream will be returned in the body of the HTTP response.

Property	Value
Enforced by	`distributor`
Outcome	Request rejected
Retryable	No
Sample discarded	Yes
Configurable per tenant	No
HTTP status code	`400 Bad Request`

Ensure query fairness within tenants using actors

Thu, 09 Apr 2026 02:28:18 +0000

Ensure query fairness within tenants using actors

Loki uses shuffle sharding to minimize impact across tenants in case of querier failures or misbehaving neighboring tenants.

When there are potentially a lot of different actors using the same tenant to query logs, such as users accessing Loki from Grafana or via LogCLI or other applications using the HTTP API, it can lead to contention between queries of different users, because they all share the same resources for a tenant.

In that case, as an operator, you would also want to ensure some sort of query fairness across these actors within the tenants. An actor could be a Grafana user, a CLI user, or an application accessing the API. To achieve that, Loki introduced hierarchical scheduler queues in version 2.9 based on LID 0003: Query fairness across users within tenants and they are enabled by default.

What are hierarchical queues and how do they work

To understand hierarchical queues, we first need to know that in the scheduler component each tenant has its own first in first out (FIFO) queue where sub-queries are enqueued. Sub-queries are queries that result from splitting and sharding of a query sent by a client using HTTP.

Tenant queues are the first level of the queue hierarchy. When a tenant executes a query without any further controls, all of its sub-queries are enqueued to the first level queue.

The second level of the queue hierarchy is that the tenant can have sub-queues.

Similar to how shuffle sharding assigns queries at the tenant level, each time the Loki Scheduler makes a round-robin pick at the second level of the query hierarchy, it selects a query from the tenant’s local queue and subqueues.

The figure above shows that a tenant queue has a local queue, which is a leaf node in the queue tree, and a set of sub-queues. Each sub-queue, again like the tenant queue, consists of a local queue, and possible sub-queues, resulting in a recursive tree structure.

So, how can we make use of these tree-like queue structures to achieve query fairness?

How to control query fairness

As already mentioned, by default, sub-queries are only enqueued at the first (tenant) level of the queue tree. The tenant is provided by the X-Scope-OrgID header that is required when running Loki in multi-tenant mode.

You use the HTTP header X-Loki-Actor-Path to control to which sub-queue a query (or more correctly its sub-queries) is enqueued.

The following example shows a curl command that invokes the HTTP endpoint for range queries and passes both the X-Scope-OrgID and the X-Loki-Actor-Path headers.

curl -s http://localhost:3100/loki/api/v1/query_range?xxx \
    -H 'X-Scope-OrgID: grafana' \
    -H 'X-Loki-Actor-Path: joe'

The query that this request invokes ends up in the sub-queue joe of the tenant queue grafana. Another user can use their own name in the actor path header to enqueue their queries to their own sub-queue.

Since the scheduler chooses the next task for a tenant in a round-robin manner, both actors (in our case human users) get their 50% share when the scheduler dequeues a sub-query to send to the querier.

With N actors, each actor gets 1/Nth of their share. In our example with two users, even when there are sub-queries in the local queue of the tenant, the local queue gets 1/3 and each sub-queue gets 1/3 of their share.

As the explained implementation and the header name already suggest, it is possible to enqueue queries several levels deep. To do so, you can construct a path to the sub-queue using the | delimiter in the header value, as shown in the following examples.

curl -s http://localhost:3100/loki/api/v1/query_range?xxx \
    -H 'X-Scope-OrgID: grafana' \
    -H 'X-Loki-Actor-Path: users|joe'

curl -s http://localhost:3100/loki/api/v1/query_range?xxx \
    -H 'X-Scope-OrgID: grafana' \
    -H 'X-Loki-Actor-Path: apps|logcli'

There is a limit to how deep a path and thus the queue tree can be. This is controlled by the Loki -query-scheduler.max-queue-hierarchy-levels CLI argument or its respective YAML configuration block:

query_scheduler:
  max_queue_hierarchy_levels: 2 # defaults to 3

It is advised to keep the levels at a reasonable level (ideally 1 to 3 levels), both for performance reasons as well as for the understanding of how query fairness is ensured across all sub-queues.

Enforcing headers

In the examples above the client that invoked the query directly against Loki also provided the HTTP header that controls where in the queue tree the sub-queries are enqueued. However, as an operator, you would usually want to avoid this scenario and control yourself where the header is set.

When using Grafana as the Loki user interface, you can, for example, create multiple data sources with the same tenant, but with a different additional HTTP header X-Loki-Actor-Path and restrict which Grafana user can use which data source.

Alternatively, if you have a proxy for authentication in front of Loki, you can pass the (hashed) user from the authentication as downstream header to Loki.

Isolate tenant workflows using shuffle sharding

Thu, 09 Apr 2026 02:28:18 +0000

Isolate tenant workflows using shuffle sharding

Shuffle sharding is a resource-management technique used to isolate tenant workloads from other tenant workloads, to give each tenant more of a single-tenant experience when running in a shared cluster. This technique is explained by AWS in their article Workload isolation using shuffle-sharding. A reference implementation has been shown in the Route53 Infima library.

The issues that shuffle sharding mitigates

Shuffle sharding can be configured for the query path.

The query path is sharded by default, and the default does not use shuffle sharding. Each tenant’s query is sharded across all queriers, so the workload uses all querier instances.

In a multi-tenant cluster, sharding across all instances of a component may exhibit these issues:

Any outage of a component instance affects all tenants
A misbehaving tenant affects all other tenants

An individual query may create issues for all tenants. A single tenant or a group of tenants may issue an expensive query: one that causes a querier component to hit an out-of-memory error, or one that causes a querier component to crash. Once the error occurs, the tenant or tenants issuing the error-causing query will be reassigned to other running queriers(remember all tenants can use all available queriers), This, in turn, may affect the queriers that have been reassigned.

How shuffle sharding works

The idea of shuffle sharding is to assign each tenant to a shard composed by a subset of the Loki queriers, aiming to minimize the overlapping instances between distinct tenants.

A misbehaving tenant will affect only its shard’s queriers. Due to the low overlap of queriers among tenants, only a small subset of tenants will be affected by the misbehaving tenant. Shuffle sharding requires no more resources than the default sharding strategy.

Shuffle sharding does not fix all issues. If a tenant repeatedly sends a problematic query, the crashed querier will be disconnected from the query-frontend, and a new querier will be immediately assigned to the tenant’s shard. This invalidates the positive effects of shuffle sharding. In this case, configuring a delay between when a querier disconnects because of a crash, and when the crashed querier is actually removed from the tenant’s shard and another healthy querier is added as a replacement improves the situation. A delay of 1 minute may be a reasonable value in the query-frontend with configuration parameter -query-frontend.querier-forget-delay=1m, and in the query-scheduler with configuration parameter -query-scheduler.querier-forget-delay=1m.

Low probability of overlapping instances

If an example Loki cluster runs 50 queriers and assigns each tenant 4 out of 50 queriers, shuffling instances between each tenant, there are 230K possible combinations.

Statistically, randomly picking two distinct tenants, there is:

a 71% chance that they will not share any instance
a 26% chance that they will share only 1 instance
a 2.7% chance that they will share 2 instances
a 0.08% chance that they will share 3 instances
only a 0.0004% chance that their instances will fully overlap

Configuration

Enable shuffle sharding by setting -frontend.max-queriers-per-tenant to a value higher than 0 and lower than the number of available queriers. The value of the per-tenant configuration max_queriers_per_tenant sets the quantity of allocated queriers. This option is only available when using the query-frontend, with or without a scheduler.

The per-tenant configuration parameter max_query_parallelism describes how many sub queries, after query splitting and query sharding, can be scheduled to run at the same time for each request of any tenant.

Configuration parameter querier.concurrency controls the quantity of worker threads (goroutines) per single querier.

The maximum number of queriers can be overridden on a per-tenant basis in the limits overrides configuration by max_queriers_per_tenant.

Shuffle sharding metrics

These metrics reveal information relevant to shuffle sharding:

the overall query-scheduler queue duration, loki_query_scheduler_queue_duration_seconds_*
the query-scheduler queue length per tenant, loki_query_scheduler_queue_length

the query-scheduler queue duration per tenant can be found with this query:

max_over_time({cluster="$cluster",container="query-frontend", namespace="$namespace"} |= "metrics.go" |logfmt | unwrap duration(queue_time) | __error__="" [5m]) by (org_id)

Too many spikes in any of these metrics may imply:

A particular tenant is trying to use more query resources than they were allocated.
That tenant may need an increase in the value of max_queriers_per_tenant.
Loki instances may be under provisioned.

A useful query checks how many queriers are being used by each tenant:

count by (org_id) (sum by (org_id, pod) (count_over_time({job="$namespace/querier", cluster="$cluster"} |= "metrics.go" | logfmt [$__interval])))

Loki meta-monitoring

Thu, 09 Apr 2026 02:28:18 +0000

Loki meta-monitoring

As part of your Loki implementation, you will also want to monitor your Loki cluster.

As a best practice, you should collect data about Loki in a separate instance of Loki, Prometheus, and Grafana. For example, send your Loki cluster data to a Grafana Cloud account. This will let you troubleshoot a broken Loki cluster from a working one.

Loki exposes the following observability data about itself:

Metrics: Loki provides a /metrics endpoint that sends information about Loki in Prometheus format. These metrics provide aggregated metrics of the health of your Loki cluster, allowing you to observe query response times, etc. Each Loki component sends its own metrics, allowing for fine-grained monitoring of the health of your Loki cluster. For more information about the metrics Loki exposes, refer to metrics. It is important to keep metrics cardinality in mind when running a large distributed Loki cluster.
Logs: Loki emits a detailed log line metrics.go for every query, which shows query duration, number of lines returned, query throughput, the specific LogQL that was executed, chunks searched, and much more. You can use these log lines to improve and optimize your query performance. You can also collect pod logs from your Loki components to monitor and drill down into specific issues.

Monitoring Loki

There are three primary components to monitoring Loki:

Kubernetes Monitoring Helm: The Kubernetes Monitoring Helm chart provides a comprehensive monitoring solution for Kubernetes clusters. It also provides direct integrations for monitoring the full LGTM (Loki, Grafana, Tempo and Mimir) stack. To learn how to deploy the Kubernetes Monitoring Helm chart, refer to deploy meta-monitoring.
Grafana Cloud account or a separate LGTM stack: The data collected from the Loki cluster can be sent to a Grafana Cloud account or a separate LGTM stack. We recommend using Grafana Cloud since it is Grafana Lab’s responsibility to maintain the availability and performance of the Grafana Cloud services.
The Loki mixin: is an opinionated set of dashboards, alerts, and recording rules to monitor your Loki cluster. The mixin provides a comprehensive package for monitoring Loki in production. You can install the mixin into a Grafana instance. To install the Loki mixin, follow these directions.

You should also plan separately for infrastructure-level monitoring, to monitor the capacity or throughput of your storage provider, for example, or your networking layer.

The Kubernetes Monitoring Helm chart Grafana Labs uses to monitor Loki also provides these features out of the box with Kubernetes monitoring enabled by default. You can choose which of these features to enable or disable based on how much data you want to collect and your meta-monitoring budget.

Loki Metrics

As Loki is a distributed system, each component exports its own metrics. The /metrics endpoint exposes hundreds of different metrics. You can find a sampling of the metrics exposed by Loki and their descriptions, in the sections below.

You can find a complete list of the exposed metrics by checking the /metrics endpoint.

http://<host>:<http_listen_port>/metrics

For example:

http://localhost:3100/metrics

Both Grafana Loki and Alloy expose a /metrics endpoint that expose Prometheus metrics (the default port is 3100 for Loki and 12345 for Alloy). To store these metrics, you can use Prometheus or Mimir.

All components of Loki expose the following metrics:

Metric Name	Metric Type	Description
`loki_internal_log_messages_total`	Counter	Total number of log messages created by Loki itself.
`loki_request_duration_seconds`	Histogram	Number of received HTTP requests.

For a deeper look at which metrics are most important for detecting negative trends and abnormal behavior, refer to Key metrics for monitoring Loki.

Note that most of the metrics are counters and should continuously increase during normal operations.

Your app emits a log line to a file that is tracked by Alloy.
Alloy reads the new line and increases its counters.
Alloy forwards the log line to a Loki distributor, where the received counters should increase.
The Loki distributor forwards the log line to a Loki ingester, where the request duration counter should increase.

If Alloy uses any pipelines with metrics stages, those metrics will also be exposed by Alloy at its /metrics endpoint.

Metrics cardinality

Some metrics carry labels that increase cardinality in large environments:

Client-side: Alloy and Promtail emit per-file metrics using a filename label. In environments with many tracked files, this can produce a large number of unique time series.
Server-side: Loki metrics such as loki_discarded_samples_total and loki_ingester_chunks_stored_total include a tenant label. Multi-tenant deployments with many tenants see proportional cardinality growth.

The Kubernetes Monitoring Helm chart includes metric relabeling rules to manage cardinality. If you auto-scale Loki components, be aware that each new pod adds its own set of per-instance time series.

Example Loki log line: metrics.go

Loki emits a metrics.go log line from the Querier, Query frontend and Ruler components, which lets you inspect query and recording rule performance. This is an example of a detailed log line metrics.go for a query.

Example log

level=info ts=2024-03-11T13:44:10.322919331Z caller=metrics.go:143 component=frontend org_id=mycompany latency=fast query="sum(count_over_time({kind=\"auditing\"} | json | user_userId =`` [1m]))" query_type=metric range_type=range length=10m0s start_delta=10m10.322900424s end_delta=10.322900663s step=1s duration=47.61044ms status=200 limit=100 returned_lines=0 throughput=9.8MB total_bytes=467kB total_entries=1 queue_time=0s subqueries=2 cache_chunk_req=1 cache_chunk_hit=1 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=14394 cache_index_req=19 cache_index_hit=19 cache_result_req=1 cache_result_hit=1

You can use the query-frontend metrics.go lines to understand a query’s overall performance. The metrics.go line output by the Queriers contains the same information as the Query frontend but is often more helpful in understanding and troubleshooting query performance. This is largely because it can tell you how the querier spent its time executing the subquery. Here are the most useful stats:

total_bytes: how many total bytes the query processed
duration: how long the query took to execute
throughput: total_bytes/duration
total_lines: how many total lines the query processed
length: how much time the query was executed over
post_filter_lines: how many lines matched the filters in the query
cache_chunk_req: total number of chunks fetched for the query (the cache will be asked for every chunk so this is equivalent to the total chunks requested)
splits: how many pieces the query was split into based on time and split_queries_by_interval
shards: how many shards the query was split into

For more information, refer to the blog post The concise guide to Loki: How to get the most out of your query performance.

Configure Logging Levels

To change the configuration for Loki logging levels, update log_level configuration parameter in your config.yaml file.

# Only log messages with the given severity or above. Valid levels: [debug,
# info, warn, error]
# CLI flag: -log.level
[log_level: <string> | default = "info"]

Manage and debug errors

Thu, 09 Apr 2026 02:28:18 +0000

Manage and debug errors

The section provides information to help you troubleshoot issues with Grafana Loki.

Manage authentication

Thu, 09 Apr 2026 02:28:18 +0000

Manage authentication

Grafana Loki does not come with any included authentication layer. You must run an authenticating reverse proxy in front of your services.

The simple scalable and microservices deployment modes require a reverse proxy to be deployed in front of Loki, to direct client API requests to the various components.

By default the Loki Helm chart includes a default reverse proxy configuration, using an nginx container to handle routing traffic and authorization.

A list of open-source reverse proxies you can use:

HAProxy
nginx using their guide on restricting access with HTTP basic authentication
OAuth2 proxy
Pomerium, which has a guide for securing Grafana

Note
When using Loki in multi-tenant mode, Loki requires the HTTP header X-Scope-OrgID to be set to a string identifying the tenant; the responsibility of populating this value should be handled by the authenticating reverse proxy. For more information, read the multi-tenancy documentation.

For information on configuring authentication for your log shipping agent, see the Grafana Alloy documentation.

Enable basic authentication for Loki using nginx

This section describes the process of enabling basic authentication for Loki using nginx.

Prerequisites

A running Loki instance
A running nginx instance

Configure nginx

You must create a new nginx configuration file for the Loki instance.

This example assumes the following:

nginx is running in /opt/homebrew
Loki is running on port 3100 on the local machine
Your Loki tenant id is fake
The configuration file is named /opt/homebrew/etc/nginx/loki.conf

If you used different configuration parameters for Loki, adjust the examples to match your configuration.

loki.conf configuration:

upstream loki {
  server 127.0.0.1:3100;
  keepalive 15;
}

server {
  listen 80;
  server_name loki.localhost;

  auth_basic "loki auth";
  auth_basic_user_file /opt/homebrew/etc/nginx/passwords;

  location / {
    proxy_read_timeout 1800s;
    proxy_connect_timeout 1600s;
    proxy_pass http://loki;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_redirect off;
  }

  location /ready {
    proxy_pass http://loki;
    proxy_http_version 1.1;
    proxy_set_header Connection "Keep-Alive";
    proxy_set_header Proxy-Connection "Keep-Alive";
    proxy_redirect off;
    auth_basic "off";
  }
}

This configuration must be included in your main nginx configuration, for example, by including it in nginx.conf like:

include /opt/homebrew/etc/nginx/loki.conf;

Restart the nginx server to ensure all configuration changes are updated.

Validate your nginx configuration

To validate the nginx configuration for Loki, you can send a curl request to two endpoints:

The /ready endpoint, which is not protected by a basic authentication mechanism.

% curl -i http://loki.localhost/ready

HTTP/1.1 200 OK
Server: nginx/1.29.2
Date: Thu, 16 Oct 2025 14:28:31 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 6
Connection: keep-alive
X-Content-Type-Options: nosniff

ready

The / endpoint, which is protected by a basic authentication mechanism.

curl -i http://loki.localhost/

HTTP/1.1 401 Unauthorized
Server: nginx/1.29.2
Date: Thu, 16 Oct 2025 14:32:43 GMT
Content-Type: text/html
Content-Length: 179
Connection: keep-alive
WWW-Authenticate: Basic realm="loki auth"

<html>
<head><title>401 Authorization Required</title></head>
<body>
<center><h1>401 Authorization Required</h1></center>
<hr><center>nginx/1.29.2</center>
</body>
</html>

Update passwords

The password file can be seeded using whatever mechanism you may use for other web services.

In this example, htpasswd is utilized:

% htpasswd -c /opt/homebrew/etc/nginx/passwords loki123

New password:
Re-type new password:
Adding password for user loki123

Restart the nginx server to ensure all configuration changes are updated.

Validate passwords

Enter your password into a temporary file, such as:

% vi lokipw

Then, store it as an environment variable::

% pass=$(cat lokipw)

You can validate basic authentication is then working by issuing a curl command to the protected resource:

curl -i -u loki123:$pass -H "X-Scope-OrgID:fake" "http://loki.localhost/loki/api/v1/labels"

HTTP/1.1 200 OK
Server: nginx/1.29.2
Date: Thu, 16 Oct 2025 14:46:09 GMT
Content-Type: application/json; charset=UTF-8
Content-Length: 21
Connection: keep-alive

{"status":"success"}

Manage bloom filter building and querying (Experimental)

Thu, 09 Apr 2026 02:28:18 +0000

Manage bloom filter building and querying (Experimental)

Warning
In Loki and Grafana Enterprise Logs (GEL), Query acceleration using blooms is an experimental feature. Engineering and on-call support is not available. No SLA is provided. Note that this feature is intended for users who are ingesting more than 75TB of logs a month, as it is designed to accelerate queries against large volumes of logs.

In Grafana Cloud, Query acceleration using bloom filters is enabled as a public preview for select large-scale customers that are ingesting more that 75TB of logs a month. Limited support and no SLA are provided.

Loki leverages bloom filters to speed up queries by reducing the amount of data Loki needs to load from the store and iterate through. Loki is often used to run “needle in a haystack” queries; these are queries where a large number of log lines are searched, but only a few log lines match the query. Some common use cases are searching all logs tied to a specific trace ID or customer ID.

An example of such queries would be looking for a trace ID on a whole cluster for the past 24 hours:

{cluster="prod"} | traceID="3c0e3dcd33e7"

Without accelerated filtering, Loki downloads all the chunks for all the streams matching {cluster="prod"} for the last 24 hours and iterates through each log line in the chunks, checking if the structured metadata key traceID with value 3c0e3dcd33e7 is present.

With accelerated filtering, Loki is able to skip most of the chunks and only process the ones where we have a statistical confidence that the structured metadata pair might be present.

To learn how to write queries to use bloom filters, refer to Query acceleration.

Enable bloom filters

Warning
Building and querying bloom filters are by design not supported in single binary deployment. It can be used with Simple Scalable deployment (SSD), but it is recommended to run bloom components only in fully distributed microservice mode. The reason is that bloom filters also come with a relatively high cost for both building and querying the bloom filters that only pays off at large scale deployments.

To start building and using blooms you need to:

Deploy the Bloom Planner and Builder components (as microservices or via the SSD backend target) and enable the components in the Bloom Build config.
Deploy the Bloom Gateway component (as a microservice or via the SSD backend target) and enable the component in the Bloom Gateway config.
Enable blooms building and filtering for each tenant individually, or for all of them by default.

# Configuration block for the bloom creation.
bloom_build:
  enabled: true
  planner:
    planning_interval: 6h
  builder:
    planner_address: bloom-planner.<namespace>.svc.cluster.local:9095

# Configuration block for bloom filtering.
bloom_gateway:
  enabled: true
  client:
    addresses: dnssrvnoa+_bloom-gateway-grpc._tcp.bloom-gateway-headless.<namespace>.svc.cluster.local

# Enable blooms creation and filtering for all tenants by default
# or do it on a per-tenant basis.
limits_config:
  bloom_creation_enabled: true
  bloom_split_series_keyspace_by: 1024
  bloom_gateway_enable_filtering: true

For more configuration options refer to the Bloom Gateway, Bloom Build and per tenant-limits configuration docs. We strongly recommend reading the whole documentation for this experimental feature before using it.

Bloom Planner and Builder

Building bloom filters from the chunks in the object storage is done by two components: the Bloom Planner and the Bloom Builder, where the planner creates tasks for bloom building, and sends the tasks to the builders to process and upload the resulting blocks. Bloom filters are grouped in bloom blocks spanning multiple streams (also known as series) and chunks from a given day. To learn more about how blocks and metadata files are organized, refer to the Building blooms section below.

The Bloom Planner runs as a single instance and calculates the gaps in fingerprint ranges for a certain time period for a tenant for which bloom filters need to be built. It dispatches these tasks to the available builders. The planner also applies the blooms retention.

Warning
Do not run more than one instance of the Bloom Planner.

The Bloom Builder is a stateless horizontally scalable component and can be scaled independently of the planner to fulfill the processing demand of the created tasks.

You can find all the configuration options for these components in the Configure section for the Bloom Builder. Refer to the Enable bloom filters section above for a configuration snippet enabling this feature.

Retention

The Bloom Planner applies bloom block retention on object storage. Retention is disabled by default. When enabled, retention is applied to all tenants. The retention for each tenant is the longest of its configured general retention (retention_period) and the streams retention (retention_stream).

For example, in the following example, tenant A has a bloom retention of 30 days, and tenant B a bloom retention of 40 days for the {namespace="prod"} stream.

overrides:
    "A":
        retention_period: 30d
    "B":
        retention_period: 30d
        retention_stream:
            - selector: '{namespace="prod"}'
              priority: 1
              period: 40d

Planner and Builder sizing and configuration

The single planner instance runs the planning phase for bloom blocks for each tenant in the given interval and puts the created tasks to an internal task queue. Builders process tasks sequentially by pulling them from the queue. The amount of builder replicas required to complete all pending tasks before the next planning iteration depends on the value of -bloom-build.planner.bloom_split_series_keyspace_by, the number of tenants, and the log volume of the streams.

The maximum block size is configured per tenant via -bloom-build.max-block-size. The actual block size might exceed this limit given that we append streams blooms to the block until the block is larger than the configured maximum size. Blocks are created in memory and as soon as they are written to the object store they are freed. Chunks and TSDB files are downloaded from the object store to the file system. We estimate that builders are able to process 4MB worth of data per second per core.

Bloom Gateway

Bloom Gateways handle chunks filtering requests from the index gateway. The service takes a list of chunks and a filtering expression and matches them against the blooms, filtering out those chunks not matching the given filter expression.

This component is horizontally scalable and every instance only owns a subset of the stream fingerprint range for which it performs the filtering. The sharding of the data is performed on the client side using DNS discovery of the server instances and the jumphash algorithm for consistent hashing and even distribution of the stream fingerprints across Bloom Gateway instances.

You can find all the configuration options for this component in the Configure section for the Bloom Gateways. Refer to the Enable bloom filters section above for a configuration snippet enabling this feature.

Gateway sizing and configuration

Bloom Gateways use their local file system as a Least Recently Used (LRU) cache for blooms that are downloaded from object storage. The size of the blooms depend on the ingest volume and number of unique structured metadata key-value pairs, as well as on build settings of the blooms, namely false-positive-rate. With default settings, bloom filters make up <1% of the raw structured metadata size.

Since reading blooms depends heavily on disk IOPS, Bloom Gateways should make use of multiple, locally attached SSD disks (NVMe) to increase I/O throughput. Multiple directories on different disk mounts can be specified using the -bloom.shipper.working-directory setting when using a comma separated list of mount points, for example:

-bloom.shipper.working-directory="/mnt/data0,/mnt/data1,/mnt/data2,/mnt/data3"

Bloom Gateways need to deal with relatively large files: the bloom filter blocks. Even though the binary format of the bloom blocks allows for reading them into memory in smaller pages, the memory consumption depends on the number of pages that are concurrently loaded into memory for processing. The product of three settings control the maximum amount of bloom data in memory at any given time: -bloom-gateway.worker-concurrency, -bloom-gateway.block-query-concurrency, and -bloom.max-query-page-size.

Example, assuming 4 CPU cores:

-bloom-gateway.worker-concurrency=4      // 1x NUM_CORES
-bloom-gateway.block-query-concurrency=8 // 2x NUM_CORES
-bloom.max-query-page-size=64MiB

4 x 8 x 64MiB = 2048MiB

Here, the memory requirement for block processing is 2GiB. To get the minimum requirements for the Bloom Gateways, you need to double the value.

Building blooms

Bloom filters are built per stream and aggregated together into block files. Streams are assigned to blocks by their fingerprint, following the same ordering scheme as Loki’s TSDB and sharding calculation. This gives a data locality benefit when querying as streams in the same shard are likely to be in the same block.

In addition to blocks, builders maintain a list of metadata files containing references to bloom blocks and the TSDB index files they were built from. Gateways and the planner use these metadata files to discover existing blocks.

Every -bloom-build.planner.interval, the planner will load the latest TSDB files for all tenants for which bloom building is enabled, and compares the TSDB files with the latest bloom metadata files. If there are new TSDB files or any of them have changed, the planner will create a task for the streams and chunks referenced by the TSDB file.

The builder pulls a task from the planner’s queue and processes the containing streams and chunks. For a given stream, the builder will iterate through all the log lines inside its new chunks and build a bloom for the stream. In case of changes for a previously processed TSDB file, builders will try to reuse blooms from existing blocks instead of building new ones from scratch. The builder converts structured metadata from each log line of each chunk of a stream and appends the hash of each key, and key-value pair to the bloom, followed by the hashes combined with the chunk identifier. The first set of hashes allows gateways to skip whole streams, while the latter is for skipping individual chunks.

For example, given structured metadata foo=bar in the chunk c6dj8g, we append to the stream bloom the following hashes: hash("foo"), hash("foo=bar"), hash("c6dj8g" + "foo") and hash("c6dj8g" + "foo=bar").

Query sharding

Query acceleration does not just happen while processing chunks, but also happens from the query planning phase where the query frontend applies query sharding. Loki 3.0 introduces a new per-tenant configuration flag tsdb_sharding_strategy which defaults to computing shards as in previous versions of Loki by using the index stats to come up with the closest power of two that would optimistically divide the data to process in shards of roughly the same size. Unfortunately, the amount of data each stream has is often unbalanced with the rest, therefore, some shards end up processing more data than others.

Query acceleration introduces a new sharding strategy: bounded, which uses blooms to reduce the chunks to be processed right away during the planning phase in the query frontend, as well as evenly distributes the amount of chunks each sharded query will need to process.

Manage large volume log streams with automatic stream sharding

Thu, 09 Apr 2026 02:28:18 +0000

Manage large volume log streams with automatic stream sharding

Automatic stream sharding can keep streams under a desired_rate by adding new labels and values to existing streams. When properly tuned, this can eliminate issues where log producers are rate limited due to the per-stream rate limit.

To enable automatic stream sharding:

Edit the global limits_config of the Loki configuration file:
YAML
```
limits_config:
  shard_streams:
    enabled: true
```
Optionally lower the desired_rate in bytes if you find that the system is still hitting the per_stream_rate_limit:
YAML
```
limits_config:
  shard_streams:
    enabled: true
    desired_rate: 2097152 #2MiB
```
Optionally enable logging_enabled for debugging stream sharding.
Note
This may affect the ingestion performance of Loki.
YAML
```
limits_config:
  shard_streams:
    enabled: true
    logging_enabled: true
```

When to use automatic stream sharding

Large log streams present several problems for Loki, namely increased and uneven resource usage on Ingesters and Distributors. The general recommendation is to explore existing log streams for additional label values that are both useful for querying and sufficiently low cardinality. There are many cases, however, where no more labels can be extracted, or cardinality for a label is dangerously large. To protect itself from such volume leading to operational failure, Loki implements per-stream rate limits; but the result is that some data is lost. The per-stream limit also needs human intervention to change, which is not ideal when log volumes increase and decrease.

Loki uses automatic stream sharding to avoid rate limiting and large streams for any log stream by ensuring it is close to a configured desired_rate.

How automatic stream sharding works

Automatic stream sharding works by adding a new label, __stream_shard__, to streams and incrementing its value to try and keep all streams below a configured desired_rate.

The feature adds a new API to Ingesters that reports the size of all existing log streams. Once per second, Distributors query the API to get a picture of all stream rates in the system. Distributors use the existing stream-rate data and a configured desired_rate to determine how many shards a given stream should have. The desired number of new log streams are created with the label __stream_shard__ and logs are divided evenly among the streams.

Because automatic stream sharding is reactive and relies on successive calls to Ingesters, the view of current rates is always somewhat behind. As a result, the actual size of sharded streams will always be higher than the desired_rate. In practice, this is still sufficient to keep log producers from being rate limited by per-stream rate limits.

Automatic stream sharding metrics

Use these metrics to help tune Loki so that it is sharding streams aggressively enough to avoid the per-stream rate limit:

loki_rate_store_refresh_failures_total: The total number of failed attempts to refresh the distributor’s view of stream rates.
loki_rate_store_streams: The number of unique streams reported by all Ingesters. Sharded streams are reported as if they were unsharded.
loki_rate_store_max_stream_shards: The maximum number of shards for any tenant of the system.
loki_rate_store_stream_shards: A histogram of the distribution of shard counts across all streams.
loki_rate_store_max_stream_rate_bytes: The maximum stream size in bytes/second for any tenant of the system. Sharded streams are reported as if they are unsharded.
loki_rate_store_max_unique_stream_rate_bytes: The maximum size of any stream across all tenants. Stream shards are individually reported.
loki_rate_store_stream_rate_bytes: A histogram of the distribution of stream sizes across all tenants in bytes/second.
loki_stream_sharding_count: The total number of times that streams have been sharded. Useful for calculating the sharding rate.

Manage larger production deployments

Thu, 09 Apr 2026 02:28:18 +0000

Manage larger production deployments

When needing to scale Loki due to increased log volume, operators should consider running several Loki processes partitioned by role (ingester, distributor, querier, and so on) rather than a single Loki process. Grafana Labs’ production setup contains .libsonnet files that demonstrates configuring separate components and scaling for resource usage.

Separate Query Scheduler

The Query frontend has an in-memory queue that can be moved out into a separate process similar to the Grafana Mimir query-scheduler. This allows running multiple query frontends.

To run with the Query Scheduler, the frontend needs to be passed the scheduler’s address via -frontend.scheduler-address and the querier processes needs to be started with -querier.scheduler-address set to the same address. Both options can also be defined via the configuration file.

It is not valid to start the querier with both a configured frontend and a scheduler address.

The query scheduler process itself can be started via the -target=query-scheduler option of the Loki Docker image. For instance, docker run grafana/loki:latest -config.file=/etc/loki/config.yaml -target=query-scheduler -server.http-listen-port=8009 -server.grpc-listen-port=9009 starts the query scheduler listening on ports 8009 and 9009.

Memory ballast

In compute-constrained environments, garbage collection can become a significant performance factor. Frequently-run garbage collection interferes with running the application by using CPU resources. The use of memory ballast can mitigate the issue. Memory ballast allocates extra, but unused virtual memory in order to inflate the quantity of live heap space. Garbage collection is triggered by the growth of heap space usage. The inflated quantity of heap space reduces the perceived growth, so garbage collection occurs less frequently.

Configure memory ballast using the ballast_bytes configuration option.

Remote rule evaluation

This feature was first proposed in LID-0002; it contains the design decisions which informed the implementation.

By default, the ruler component embeds a query engine to evaluate rules. This generally works fine, except when rules are complex or have to process a large amount of data regularly. Poor performance of the ruler manifests as recording rules metrics with gaps or missed alerts. This situation can be detected by alerting on the loki_prometheus_rule_group_iterations_missed_total metric when it has a non-zero value.

A solution to this problem is to externalize rule evaluation from the ruler process. The ruler embedded query engine is single-threaded, meaning that rules are not split, sharded, or otherwise accelerated like regular Loki queries. The query-frontend component exists explicitly for this purpose and, when combined with a number of querier instances, can massively improve rule evaluation performance and lead to fewer missed iterations.

It is generally recommended to create a separate query-frontend deployment and querier pool from your existing one - which handles adhoc queries via Grafana, logcli, or the API. Rules should be given priority over adhoc queries because they are used to produce metrics or alerts which may be crucial to the reliable operation of your service; if you use the same query-frontend and querier pool for both, your rules will be executed with the same priority as adhoc queries which could lead to unpredictable performance.

To enable remote rule evaluation, set the following configuration options:

ruler:
  evaluation:
    mode: remote
    query_frontend:
      address: dns:///<query-frontend-service>:<grpc-port>

See here for further configuration options.

When you enable remote rule evaluation, the ruler component becomes a gRPC client to the query-frontend service; this will result in far lower ruler resource usage because the majority of the work has been externalized. The LogQL queries coming from the ruler will be executed against the given query-frontend service. Requests will be load-balanced across all query-frontend IPs if the dns:/// prefix is used.

Note
Queries that fail to execute are not retried.

Limits and Observability

Remote rule evaluation can be tuned with the following options:

ruler_remote_evaluation_timeout: maximum allowable execution time for rule evaluations
ruler_remote_evaluation_max_response_size: maximum allowable response size over gRPC connection from query-frontend to ruler

Both of these can be specified globally in the limits_config section or on a per-tenant basis.

Remote rule evaluation exposes a number of metrics:

loki_ruler_remote_eval_request_duration_seconds: time taken for rule evaluation (histogram)
loki_ruler_remote_eval_response_bytes: number of bytes in rule evaluation response (histogram)
loki_ruler_remote_eval_response_samples: number of samples in rule evaluation response (histogram)
loki_ruler_remote_eval_success_total: successful rule evaluations (counter)
loki_ruler_remote_eval_failure_total: unsuccessful rule evaluations with reasons (counter)

Each of these metrics are per-tenant, so cardinality must be taken into consideration.

Manage recording rules

Thu, 09 Apr 2026 02:28:18 +0000

Manage recording rules

Recording rules are queries that run in an interval and produce metrics from logs that can be pushed to a Prometheus compatible backend.

Recording rules are evaluated by the ruler component. Each ruler acts as its own querier, in the sense that it executes queries against the store without using the query-frontend or querier components. It will respect all query limits put in place for the querier.

The Loki implementation of recording rules largely reuses Prometheus’ code.

Samples generated by recording rules are sent to Prometheus using Prometheus’ remote-write feature.

Write-Ahead Log (WAL)

All samples generated by recording rules are written to a WAL. The WALs main benefit is that it persists the samples generated by recording rules to disk, which means that if your ruler crashes, you won’t lose any data. We are trading off extra memory usage and slower start-up times for this functionality.

A WAL is created per tenant; this is done to prevent cross-tenant interactions. If all samples were to be written to a single WAL, this would increase the chances that one tenant could cause data-loss for others. A typical scenario here is that Prometheus will, for example, reject a remote-write request with 100 samples if just 1 of those samples is invalid in some way.

Start-up

When the ruler starts up, it will load the WALs for the tenants who have recording rules. These WAL files are stored on disk and are loaded into memory.

Note
WALs are loaded one at a time upon start-up. This is a current limitation of the Loki ruler. For this reason, it is adviseable that the number of rule groups serviced by a ruler be kept to a reasonable size, since no rule evaluation occurs while WAL replay is in progress (this includes alerting rules).

Truncation

WAL files are regularly truncated to reduce their size on disk. This guide from one of the Prometheus maintainers (Ganesh Vernekar) gives an excellent overview of the truncation, checkpointing, and replaying of the WAL.

Cleaner

WAL Cleaner is an experimental feature.

The WAL Cleaner watches for abandoned WALs (tenants who no longer have recording rules associated) and deletes them. Enable this feature only if you are running into storage concerns with WALs that are too large. WALs should not grow excessively large due to truncation.

Scaling

See Mimir’s guide for configuring Grafana Mimir hash rings for scaling the ruler using a ring.

Note
The ruler shards by rule group, not by individual rules. This is an artifact of the fact that Prometheus recording rules need to run in order since one recording rule can reuse another - but this is not possible in Loki.

Deployment

The ruler needs to persist its WAL files to disk, and it incurs a bit of a start-up cost by reading these WALs into memory. As such, it is recommended that you try to minimize churn of individual ruler instances since rule evaluation is blocked while the WALs are being read from disk.

Kubernetes

It is recommended that you run the rulers using StatefulSets. The ruler will write its WAL files to persistent storage, so a Persistent Volume should be utilised.

Remote-Write

Client configuration

Remote-write client configuration is fully compatible with prometheus configuration format.

remote_write:	
  clients:	
    mimir:	
      url: http://mimir/api/v1/push
      write_relabel_configs:
      - action: replace
        target_label: job
        replacement: loki-recording-rules

Per-Tenant Limits

Remote-write can be configured at a global level in the base configuration, and certain parameters tuned specifically on a per-tenant basis. Most of the configuration options defined here have override options (which can be also applied at runtime!).

Tuning

Remote-write can be tuned if the default configuration is insufficient (see Failure Modes below).

There is a guide on the Prometheus website, all of which applies to Loki, too.

Rules can be evenly distributed across available rulers by using -ruler.enable-sharding=true and -ruler.sharding-strategy="by-rule". Rule groups execute in order; this is a feature inherited from Prometheus’ rule engine (which Loki uses), but Loki has no need for this constraint because rules cannot depend on each other. The default sharding strategy will shard by rule groups, but this may be undesirable as some rule groups could contain more expensive rules, which can lead to subsequent rules missing evaluations. The by-rule sharding strategy creates one rule group for each rule the ruler instance “owns” (based on its hash ring), and these rings are all executed concurrently.

Observability

Since Loki reuses the Prometheus code for recording rules and WALs, it also gains all of Prometheus’ observability.

Prometheus exposes a number of metrics for its WAL implementation, and these have all been prefixed with loki_ruler_wal_.

For example: prometheus_remote_storage_bytes_total → loki_ruler_wal_prometheus_remote_storage_bytes_total

Additional metrics are exposed, also with the prefix loki_ruler_wal_. All per-tenant metrics contain a tenant label, so be aware that cardinality could begin to be a concern if the number of tenants grows sufficiently large.

Some key metrics to note are:

loki_ruler_wal_appender_ready: whether a WAL appender is ready to accept samples (1) or not (0)
loki_ruler_wal_prometheus_remote_storage_samples_total: number of samples sent per tenant to remote storage
loki_ruler_wal_prometheus_remote_storage_samples...
- loki_ruler_wal_prometheus_remote_storage_samples_pending_total: samples buffered in memory, waiting to be sent to remote storage
- loki_ruler_wal_prometheus_remote_storage_samples_failed_total: samples that failed when sent to remote storage
- loki_ruler_wal_prometheus_remote_storage_samples_dropped_total: samples dropped by relabel configurations
- loki_ruler_wal_prometheus_remote_storage_samples_retried_total: samples re-resent to remote storage
loki_ruler_wal_prometheus_remote_storage_highest_timestamp_in_seconds: highest timestamp of sample appended to WAL
loki_ruler_wal_prometheus_remote_storage_queue_highest_sent_timestamp_seconds: highest timestamp of sample sent to remote storage.

We’ve created a basic dashboard in our loki-mixin which you can use to administer recording rules.

Failure Modes

Remote-Write Lagging

Remote-write can lag behind for many reasons:

Remote-write storage (Prometheus) is temporarily unavailable
A tenant is producing samples too quickly from a recording rule
Remote-write is tuned too low, creating backpressure

It can be determined by subtracting loki_ruler_wal_prometheus_remote_storage_queue_highest_sent_timestamp_seconds from loki_ruler_wal_prometheus_remote_storage_highest_timestamp_in_seconds.

In case 1, the ruler will continue to retry sending these samples until the remote storage becomes available again. Be aware that if the remote storage is down for longer than ruler.wal.max-age, data loss may occur after truncation occurs.

In cases 2 and 3, you should consider tuning remote-write appropriately.

Further reading: see this blog post by Prometheus maintainer Callum Styan.

Appender Not Ready

Each tenant’s WAL has an “appender” internally; this appender is used to append samples to the WAL. The appender is marked as not ready until the WAL replay is complete upon startup. If the WAL is corrupted for some reason, or is taking a long time to replay, you can determine this by alerting on loki_ruler_wal_appender_ready < 1.

Corrupt WAL

If a disk fails or the ruler does not terminate correctly, there’s a chance one or more tenant WALs can become corrupted. A mechanism exists for automatically repairing the WAL, but this cannot handle every conceivable scenario. In this case, the loki_ruler_wal_corruptions_repair_failed_total metric will be incremented.

Found another failure mode?

Open an issue and tell us about it!

Manage storage

Thu, 09 Apr 2026 02:28:18 +0000

Manage storage

You can read a high level overview of Loki storage here

Grafana Loki needs to store two different types of data: chunks and indexes.

When using Accelerated Search (experimental), then a third data type is used: bloom blocks.

Loki receives logs in separate streams, where each stream is uniquely identified by its tenant ID and its set of labels. As log entries from a stream arrive, they are compressed as chunks and saved in the chunks store. See chunk format for how chunks are stored internally.

The index stores each stream’s label set and links them to the individual chunks. Refer to the Loki configuration for details on how to configure the storage and the index.

For more information:

Store Types

✅ Supported index stores

Single Store TSDB index store which stores TSDB index files in the object store. This is the recommended index store for Loki 2.8 and newer.
Single Store BoltDB (boltdb-shipper) index store which stores boltdb index files in the object store. Recommended store for Loki 2.0 through 2.7.x.

❌ Deprecated index stores

Amazon DynamoDB. Support for this is deprecated and will be removed in a future release.
Google Bigtable. Support for this is deprecated and will be removed in a future release.
Apache Cassandra. Support for this is deprecated and will be removed in a future release.
BoltDB (doesn’t work when clustering Loki)

✅ Supported and recommended chunks stores

⚠️ Supported chunks stores, not typically recommended for production use

Filesystem (please read more about the filesystem to understand the pros/cons before using with production data)
S3 API compatible storage, such as MinIO

❌ Deprecated chunks stores

Amazon DynamoDB. Support for this is deprecated and will be removed in a future release.
Google Bigtable. Support for this is deprecated and will be removed in a future release.
Apache Cassandra. Support for this is deprecated and will be removed in a future release.

Cloud Storage Permissions

S3

When using S3 as object storage, the following permissions are needed:

s3:ListBucket
s3:PutObject
s3:GetObject
s3:DeleteObject (if running the Single Store (boltdb-shipper) compactor)

Resources: arn:aws:s3:::<bucket_name>, arn:aws:s3:::<bucket_name>/*

See the AWS deployment section on the storage page for a detailed setup guide.

DynamoDB

Note
DynamoDB support is deprecated and will be removed in a future release.

When using DynamoDB for the index, the following permissions are needed:

dynamodb:BatchGetItem
dynamodb:BatchWriteItem
dynamodb:DeleteItem
dynamodb:DescribeTable
dynamodb:GetItem
dynamodb:ListTagsOfResource
dynamodb:PutItem
dynamodb:Query
dynamodb:TagResource
dynamodb:UntagResource
dynamodb:UpdateItem
dynamodb:UpdateTable
dynamodb:CreateTable
dynamodb:DeleteTable (if table_manager.retention_period is more than 0s)

Resources: arn:aws:dynamodb:<aws_region>:<aws_account_id>:table/<prefix>*

dynamodb:ListTables

Resources: *

AutoScaling

If you enable autoscaling from table manager, the following permissions are needed:

Application Autoscaling

application-autoscaling:DescribeScalableTargets
application-autoscaling:DescribeScalingPolicies
application-autoscaling:RegisterScalableTarget
application-autoscaling:DeregisterScalableTarget
application-autoscaling:PutScalingPolicy
application-autoscaling:DeleteScalingPolicy

Resources: *

IAM

iam:GetRole
iam:PassRole

Resources: arn:aws:iam::<aws_account_id>:role/<role_name>

IBM Cloud Object Storage

When using IBM Cloud Object Storage (COS) as object storage, IAM Writer role is needed.

See the IBM Cloud Object Storage section on the storage page for a detailed setup guide.

Chunk Format

// Header
+-----------------------------------+
| Magic Number (uint32, 4 bytes)    |
+-----------------------------------+
| Version (1 byte)                  |
+-----------------------------------+
| Encoding (1 byte)                 |
+-----------------------------------+

// Blocks
+--------------------+----------------------------+
| block 1 (n bytes)  | checksum (uint32, 4 bytes) |
+--------------------+----------------------------+
| block 2 (n bytes)  | checksum (uint32, 4 bytes) |
+--------------------+----------------------------+
| ...                                             |
+--------------------+----------------------------+
| block N (n bytes)  | checksum (uint32, 4 bytes) |
+--------------------+----------------------------+

// Metas
+------------------------------------------------------------------------------------------------------------------------+
| #blocks (uvarint)                                                                                                      |
+--------------------+-----------------+-----------------+------------------+---------------+----------------------------+
| #entries (uvarint) | minTs (uvarint) | maxTs (uvarint) | offset (uvarint) | len (uvarint) | uncompressedSize (uvarint) |
+--------------------+-----------------+-----------------+------------------+---------------+----------------------------+
| #entries (uvarint) | minTs (uvarint) | maxTs (uvarint) | offset (uvarint) | len (uvarint) | uncompressedSize (uvarint) |
+--------------------+-----------------+-----------------+------------------+---------------+----------------------------+
| ...                                                                                                                    |
+--------------------+-----------------+-----------------+------------------+---------------+----------------------------+
| #entries (uvarint) | minTs (uvarint) | maxTs (uvarint) | offset (uvarint) | len (uvarint) | uncompressedSize (uvarint) |
+--------------------+-----------------+-----------------+------------------+---------------+----------------------------+
| checksum (uint32, 4 bytes)                                                                                             | 
+------------------------------------------------------------------------------------------------------------------------+

// Structured Metadata
+---------------------------------+
| #labels (uvarint)               |
+---------------+-----------------+
| len (uvarint) | value (n bytes) |
+---------------+-----------------+
| ...                             |
+---------------+-----------------+
| checksum (uint32, 4 bytes)      |
+---------------------------------+

// Footer
+-----------------------+--------------------------+
| len (uint64, 8 bytes) | offset (uint64, 8 bytes) |   // offset to Structured Metadata
+-----------------------+--------------------------+
| len (uint64, 8 bytes) | offset (uint64, 8 bytes) |   // offset to Metas
+-----------------------+--------------------------+

Manage tenant isolation

Thu, 09 Apr 2026 02:28:18 +0000

Manage tenant isolation

Grafana Loki is a multi-tenant system; requests and data for tenant A are isolated from tenant B. Requests to the Loki API should include an HTTP header (X-Scope-OrgID) that identifies the tenant for the request.

Tenant IDs can be any alphanumeric string that fits within the Go HTTP header limit (1MB). Operators are recommended to use a reasonable limit for uniquely identifying tenants; 20 bytes is usually enough.

Loki defaults to running in multi-tenant mode. Multi-tenant mode is set in the configuration with auth_enabled: true.

When configured with auth_enabled: false, Loki uses a single tenant. The X-Scope-OrgID header is not required in Loki API requests. The single tenant ID will be the string fake.

Multi-tenant Queries

In multi-tenant mode, queries may gather results from multiple tenants. Set the querier configuration option multi_tenant_queries_enabled: true to enable queries across tenants. The query API request defines the tenants. Specify multiple tenants in the query request HTTP header X-Scope-OrgID by separating the tenant IDs with the pipe character (|). For example, a query for tenants A and B requires the header X-Scope-OrgID: A|B.

Only query endpoints support multi-tenant calls. Calls to GET /loki/api/v1/tail and POST /loki/api/v1/push will return an HTTP 400 error if more than one tenant is defined in the HTTP header.

Instant and range queries support label filtering using tenant IDs. For example, the query

{app="foo", __tenant_id__=~"a.+"} | logfmt

will return results for all tenants that have a tenant ID that begins with the character a.

If the label __tenant_id__ is already present in a log stream, it is prepended with the string original_.

Tenant ID filtering in stages is not supported. An example of a query that will not work:

{app="foo"} | __tenant_id__="1" | logfmt

Restrictions

Tenant IDs must not be longer than 150 bytes and can only include the following characters:

Alphanumeric characters
- 0-9
- a-z
- A-Z
Special characters
- Exclamation point (!)
- Hyphen (-)
- Underscore (_)
- Single period (.)
- Asterisk (*)
- Single quote (')
- Open parenthesis (()
- Close parenthesis ())

Note
For security reasons, . and .. aren’t valid tenant IDs.

Manage varying workloads at scale with autoscaling queriers

Thu, 09 Apr 2026 02:28:18 +0000

Manage varying workloads at scale with autoscaling queriers

A microservices deployment of a Loki cluster that runs on Kubernetes typically handles a workload that varies throughout the day. To make Loki easier to operate and optimize the cost of running Loki at scale, we have designed a set of resources to help you autoscale your Loki queriers.

Prerequisites

You need to run Loki in Kubernetes as a set of microservices. You need to use the query-scheduler.

We recommend using Kubernetes Event-Driven Autoscaling (KEDA) to configure autoscaling based on Prometheus metrics. Refer to Deploying KEDA to learn more about setting up KEDA in your Kubernetes cluster.

Scaling metric

Because queriers pull queries from the query-scheduler queue and process them on the querier workers, you should scale metrics based on:

The scheduler queue size.
The queries running in the queriers.

The query-scheduler exposes the loki_query_scheduler_inflight_requests metric. It tracks the sum of queued queries plus the number of queries currently running in the querier workers. The following query is useful to scale queriers based on the inflight requests.

sum(
  max_over_time(
    loki_query_scheduler_inflight_requests{namespace="loki-cluster", quantile="<Q>"}[<R>]
  )
)

Use the quantile (Q) and the range (R) parameters to fine-tune the metric. The higher Q is, the more sensitive the metric is to short-lasting spikes. As R increases, you can reduce the variation over time in the metric. A higher R-value helps avoid the autoscaler from modifying the number of replicas too frequently.

In our experience, we have found that a Q of 0.75 and an R of 2 minutes work well. You can adjust these values according to your workload.

Cluster capacity planning

To scale the Loki queries, you configure the following settings:

The threshold for scaling up and down
The scale down stabilization period
The minimum and the maximum number of queriers

Querier workers process queries from the queue. You can configure each Loki querier to run several workers. To reserve workforce headroom to address workload spikes, our recommendation is not to use more than 75% of the workers. For example, if you configure the Loki queriers to run 6 workers, set a threshold of floor(0.75 * 6) = 4.

To determine the minimum number of queries that you should run, run at least one querier and determine the average number of inflight requests the system processes 75% of the time over seven days. The target utilization of the queries is 75%. So if we use 6 workers per querier, we will use the following query:

clamp_min(ceil(
    avg(
        avg_over_time(loki_query_scheduler_inflight_requests{namespace="loki-cluster", quantile="0.75"}[7d])
    ) / scalar(floor(vector(6 * 0.75)))
), 1)

The maximum number of queriers to run is equal to the number of queriers required to process all inflight requests 50% of the time during a seven-day timespan. As for the previous example, if each querier runs 6 workers, divide the inflight requests by 6. The resulting query becomes:

ceil(
    max(
        max_over_time(loki_query_scheduler_inflight_requests{namespace="loki-cluster", quantile="0.5"}[7d])
    ) / 6
)

To minimize the scenario where Loki scales up shortly after scaling down, set a stabilization window for scaling down.

KEDA configuration

This KEDA ScaledObject example configures autoscaling for the querier deployment in the loki-cluster namespace. The example shows the minimum number of replicas set to 10 and the maximum number of replicas set to 50. Because each querier runs 6 workers, aiming to use 75% of those workers, the threshold is set to 4. The metric is served at http://prometheus.default:9090/prometheus. We configure a stabilization window of 30 minutes.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: querier
  namespace: loki-cluster
spec:
  maxReplicaCount: 50
  minReplicaCount: 10
  scaleTargetRef:
    kind: Deployment
    name: querier
  triggers:
  - metadata:
      metricName: querier_autoscaling_metric
      query: sum(max_over_time(loki_query_scheduler_inflight_requests{namespace="loki-cluster", quantile="0.75"}[2m]))
      serverAddress: http://prometheus.default:9090/prometheus
      threshold: "4"
    type: prometheus
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 1800

Prometheus alerting when at capacity

Because the configured maximum might not be sufficient, a Prometheus alert can identify when the quantity of queriers has been at its configured maximum for an extended time. The following example specifies three hours (3h) as the extended time:

name: LokiAutoscalerMaxedOut
expr: kube_horizontalpodautoscaler_status_current_replicas{namespace=~"loki-cluster"} == kube_horizontalpodautoscaler_spec_max_replicas{namespace=~"loki-cluster"}
for: 3h
labels:
  severity: warning
annotations:
  description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 3h; this can indicate underprovisioning.
  summary: HPA has been running at max replicas for an extended time

Manage version upgrades

Thu, 09 Apr 2026 02:28:18 +0000

Manage version upgrades

Upgrade from one Loki version to a newer version.
Upgrade Helm from Helm v2.x to Helm v3.x.

Monitor tenant limits using the Overrides Exporter

Thu, 09 Apr 2026 02:28:18 +0000

Monitor tenant limits using the Overrides Exporter

Loki is a multi-tenant system that supports applying limits to each tenant as a mechanism for resource management. The overrides-exporter module exposes these limits as Prometheus metrics in order to help operators better understand tenant behavior.

Context

Configuration updates to tenant limits can be applied to Loki without restart via the runtime_config feature.

Example

The overrides-exporter module is disabled by default. We recommend running a single instance per cluster to avoid issues with metric cardinality. The overrides-exporter creates one metric for every scalar field in the limits configuration under the metric loki_overrides_defaults with the default value for that field after loading the Loki configuration. It also exposes another metric for every differing field for every tenant.

Using an example runtime.yaml:

overrides:
  "tenant_1":
    ingestion_rate_mb: 10
    max_streams_per_user: 100000
    max_chunks_per_query: 100000

Launch an instance of the overrides-exporter:

loki -target=overrides-exporter -runtime-config.file=runtime.yaml -config.file=basic_schema_config.yaml -server.http-listen-port=8080

To inspect the tenant limit overrides:

$ curl -sq localhost:8080/metrics | grep override
# HELP loki_overrides Resource limit overrides applied to tenants
# TYPE loki_overrides gauge
loki_overrides{limit_name="ingestion_rate_mb",user="tenant_1"} 10
loki_overrides{limit_name="max_chunks_per_query",user="tenant_1"} 100000
loki_overrides{limit_name="max_streams_per_user",user="tenant_1"} 100000
# HELP loki_overrides_defaults Default values for resource limit overrides applied to tenants
# TYPE loki_overrides_defaults gauge
loki_overrides_defaults{limit_name="cardinality_limit"} 100000
loki_overrides_defaults{limit_name="creation_grace_period"} 6e+11
loki_overrides_defaults{limit_name="ingestion_burst_size_mb"} 6
loki_overrides_defaults{limit_name="ingestion_rate_mb"} 4
loki_overrides_defaults{limit_name="max_cache_freshness_per_query"} 6e+10
loki_overrides_defaults{limit_name="max_chunks_per_query"} 2e+06
loki_overrides_defaults{limit_name="max_concurrent_tail_requests"} 10
loki_overrides_defaults{limit_name="max_entries_limit_per_query"} 5000
loki_overrides_defaults{limit_name="max_global_streams_per_user"} 5000
loki_overrides_defaults{limit_name="max_label_name_length"} 1024
loki_overrides_defaults{limit_name="max_label_names_per_series"} 30
loki_overrides_defaults{limit_name="max_label_value_length"} 2048
loki_overrides_defaults{limit_name="max_line_size"} 0
loki_overrides_defaults{limit_name="max_queriers_per_tenant"} 0
loki_overrides_defaults{limit_name="max_query_length"} 2.5956e+15
loki_overrides_defaults{limit_name="max_query_lookback"} 0
loki_overrides_defaults{limit_name="max_query_parallelism"} 32
loki_overrides_defaults{limit_name="max_query_series"} 500
loki_overrides_defaults{limit_name="max_streams_matchers_per_query"} 1000
loki_overrides_defaults{limit_name="max_streams_per_user"} 0
loki_overrides_defaults{limit_name="min_sharding_lookback"} 0
loki_overrides_defaults{limit_name="per_stream_rate_limit"} 3.145728e+06
loki_overrides_defaults{limit_name="per_stream_rate_limit_burst"} 1.572864e+07
loki_overrides_defaults{limit_name="per_tenant_override_period"} 1e+10
loki_overrides_defaults{limit_name="reject_old_samples_max_age"} 1.2096e+15
loki_overrides_defaults{limit_name="retention_period"} 2.6784e+15
loki_overrides_defaults{limit_name="ruler_evaluation_delay_duration"} 0
loki_overrides_defaults{limit_name="ruler_max_rule_groups_per_tenant"} 0
loki_overrides_defaults{limit_name="ruler_max_rules_per_rule_group"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_batch_send_deadline"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_capacity"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_max_backoff"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_max_samples_per_send"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_max_shards"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_min_backoff"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_queue_min_shards"} 0
loki_overrides_defaults{limit_name="ruler_remote_write_timeout"} 0
loki_overrides_defaults{limit_name="split_queries_by_interval"} 0

Alerts can be created based on these metrics to inform operators when tenants are close to hitting their limits allowing for increases to be applied before the tenant limits are exceeded.

Speed up ingester rollout using zone awareness

Thu, 09 Apr 2026 02:28:18 +0000

Speed up ingester rollout using zone awareness

The Loki zone aware ingesters are used by Grafana Labs in order to allow for easier rollouts of large Loki deployments. You can think of them as three logical zones, however with some extra Kubernetes configuration you could deploy them in separate zones.

By default, an incoming log stream’s logs are replicated to 3 random ingesters. Except in the case of some replica scaling up or down, a given stream will always be replicated to the same 3 ingesters. This means that if one of those ingesters is restarted no data is lost. However two or more ingesters restarting can result in data loss and also impacts the systems ability to ingest logs because of an unhealthy ring status.

With zone awareness enabled, an incoming log line will be replicated to one ingester in each zone. This means that we’re not only concerned about ingesters in multiple zones restarting at the same time, we can now rollout or lose an entire zone at once without impacting writes. This allows deployments with a large number of ingesters to be deployed much more quickly.

At Grafana Labs, we also make use of rollout-operator to manage rollouts to the 3 StatefulSets gracefully. The rollout-operator looks for labels on StatefulSets to know which ones are part of a certain rollout group, and coordinates rollouts of pods only from a single StatefulSet in the group at a time. See the README in the rollout-operator repo for a more in depth explanation.

Migration

Migrating from a single ingester StatefulSet to 3 zone aware ingester StatefulSets. The migration follows a few general steps, regardless of deployment method.

Configure your existing ingesters to be part of a zone, for example zone-default, this will allow us to later exclude them from the write path while still allowing for graceful shutdowns.
Prep for the increase in active streams (due to the way streams are split between ingesters) by increasing the number of active streams allowed for your tenants.
Add and scale up your new zone-aware ingester StatefulSets such that each has 1/3rd of the total number of replicas you want to run.
Enable zone awareness on the write path by setting distributor.zone-awareness-enabled to true for distributors and rulers.
Wait some time to ensure that the new zone-aware ingesters have data for the time period they are queried for (query_ingesters_within).
Enable zone awareness on the read path by setting distributor.zone-awareness-enabled to true for queriers.
Configure distributors and rulers to exclude ingesters in the zone-default so those ingesters no longer receive write traffic using distributor.excluded-zones.
Use the shutdown endpoint to flush data from the default ingesters, then scale down and remove the associated StatefulSet.
Clean up any config remaining from the migration.

Detailed Migration Steps

The following are steps to live migrate (no downtime) an existing Loki deployment from a single ingester StatefulSet to 3 zone aware ingester StatefulSets.

These instructions assume you are using the zone aware ingester jsonnet deployment code from this repo, see here. If you are not using jsonnet see the relevant annotations in some steps that describe how to perform that step manually.

Configure the zone for the existing ingester StatefulSet as zone-default by setting multi_zone_default_ingester_zone: true, this allows us to later filter out that zone from the write path.
Configure ingester-pdb with maxUnavailable as 0 and deploy 3x zone-aware StatefulSets with 0 replicas by setting
jsonnet
```
_config+:: {
    multi_zone_ingester_enabled: true,
    multi_zone_ingester_migration_enabled: true,
    multi_zone_ingester_replicas: 0,
    // These last two lines are necessary now that we enable zone aware ingester by default
    // so that newly created cells will not be migrated later on. If you miss them you will
    // break writes in the cell.
    multi_zone_ingester_replication_write_path_enabled: false,
    multi_zone_ingester_replication_read_path_enabled: false,
},
```
If you’re not using jsonnet, the new ingester StatefulSets should have a label with rollout-group: ingester, annotation rollout-max-unavailable: x (put a placeholder value in, later you should set the value of this to be some portion of the StatefulSets total replicas, for example in jsonnet we template this so that each StatefulSet runs 1/3 of the total replicas and the max unavailable is 1/3 of each StatefulSets replicas), and set the update strategy to OnDelete.
Diff ingester and ingester-zone-a StatefulSets and make sure all config matches
Bash
```
kubectl get statefulset -n loki-dev-008 ingester -o yaml > ingester.yaml
kubectl get statefulset -n loki-dev-008 ingester-zone-a -o yaml > ingester-zone-a.yaml
diff ingester.yaml ingester-zone-a.yaml
```
Expected in diffs are values like: creation time and revision number, the zone, fields used by rollout operator, number of replicas, anything related to kustomize/flux, and PVC for the WAL since the containers don’t exist yet.

Temporarily double max series limits for users that are using more than 50% of their current limit, the queries are as follows (add label selectors as appropriate):

sum by (tenant)(sum (loki_ingester_memory_streams) by (cluster, namespace, tenant) / on (namespace) group_left max by(namespace) (loki_distributor_replication_factor))
>
on (tenant) (
max by (tenant) (label_replace(loki_overrides{limit_name="max_global_streams_per_user"} / 2.5, "tenant", "$1", "user", "(.+)"))
)

(sum (loki_ingester_memory_streams) by (cluster, namespace, tenant) / on (namespace) group_left max by(namespace) (loki_distributor_replication_factor)
) / ignoring(tenant) group_left max by (cluster, namespace)(loki_overrides_defaults{limit_name="max_global_streams_per_user"}) > 0.4)
unless on (tenant) (
(label_replace(loki_overrides{limit_name="max_global_streams_per_user"},"tenant", "$1", "user", "(.+)")))

Scale up zone-aware StatefulSets until they have 1/3rd of replicas each. In smaller cells you can do this all at once, in larger cells it is safer to do it in chunks. The config value you need to change is multi_zone_ingester_replicas: 6, the value will be split across the three StatefulSets. In this case, each StatefulSet would run 2 replicas.

If you’re not using jsonnet, this is the step where you would also set the annotation rollout-max-unavailable to some value that is less than or equal to the number of replicas each StatefulSet is running.
Enable zone awareness on the write path by setting multi_zone_ingester_replication_write_path_enabled: true, this causes distributors and rulers to reshuffle series to distributors in each zone. Be sure to check that all the distributors and rulers have restarted properly.

If you’re not using jsonnet, enable zone awareness on the write path by setting distributor.zone-awareness-enabled to true for distributors and rulers.
Wait for query_ingesters_within configured hours. The default is 3h. This ensures that no data will be missing if we query a new ingester. However, because we cut chunks at least every 30m due to chunk_idle_period we can likely reduce this amount of time.
Check that rule evaluations are still correct on the migration, look for increases in the rate for metrics with names with the following suffixes:
```
rule_evaluations_total
rule_evaluation_failures_total
rule_group_iterations_missed_total
```
Enable zone-aware replication on the read path multi_zone_ingester_replication_read_path_enabled: true. If you’re not using jsonnet, set distributor.zone-awareness-enabled to true for queriers.
Check that queries are still executing correctly, for example look at loki_logql_querystats_latency_seconds_count to see that you don’t have a big increase in latency or error count for a specific query type.
Configure distributor / ruler to exclude ingesters in the zone-default so those ingesters no longer receive write traffic by setting multi_zone_ingester_exclude_default: true. If you’re not using jsonnet set distributor.excluded-zones on distributors and rulers.

It is a good idea to check rules evaluations again at this point, and also that the zone aware ingester StatefulSet is now receiving all the write traffic, you can compare sum(loki_ingester_memory_streams{cluster="<cluster>",job=~"(<namespace>)/ingester"}) to sum(loki_ingester_memory_streams{cluster="<cluster>",job=~"(<namespace>)/ingester-zone.*"})
If you’re using an automated reconciliation or deployment system like flux, disable it now (for example using flux ignore) if possible for just the default ingester StatefulSet.
Shutdown flush the default ingesters, unregistering them from the ring, you can do this by port-forwarding each ingester Pod and using the endpoint: "http://url:PORT/ingester/shutdown?flush=true&delete_ring_tokens=true&terminate=false"
manually scale down the default ingester StatefulSet to 0 replicas, we do this via tk apply but you could do it via modifying the yaml.
merge a PR to your central config repo to keep the StatefulSet 0’d, and then remove the flux ignore.
clean up any remaining temporary config from the migration, for example multi_zone_ingester_migration_enabled: true is no longer needed.
ensure that all the old default ingester PVC/PV are removed.

Manage Loki on Grafana Labs

Audit data propagation latency and correctness using Loki Canary

Audit data propagation latency and correctness using Loki Canary

Additional Queries

Spot Check

Metric Test

Control

Installation

Binary

Docker

Kubernetes

Examples

From Source

Configuration

Monolithic mode setup

Systemd

Scrape logs

Scrape metrics

Scrape metrics with Alloy

Scrape metrics with Prometheus

Block unwanted queries

Block unwanted queries

Observing blocked queries

Scope

Tag-based blocking

Configure caches to speed up queries

Configure caches to speed up queries

Results cache

Chunks cache

Before you begin

Steps

Enforce rate limits and push request validation

Enforce rate limits and push request validation

Terminology

Rate-Limit Errors

rate_limited

per_stream_rate_limit

stream_limit

Validation Errors

line_too_long

invalid_labels

missing_labels

too_far_behind and out_of_order

greater_than_max_sample_age

too_far_in_future

max_label_names_per_series

label_name_too_long

label_value_too_long

duplicate_label_names

Ensure query fairness within tenants using actors

Ensure query fairness within tenants using actors

What are hierarchical queues and how do they work

How to control query fairness

Enforcing headers

Isolate tenant workflows using shuffle sharding

Isolate tenant workflows using shuffle sharding

The issues that shuffle sharding mitigates

How shuffle sharding works

Low probability of overlapping instances

Configuration

Shuffle sharding metrics

Loki meta-monitoring

Loki meta-monitoring

Monitoring Loki

Loki Metrics

Metrics cardinality

Example Loki log line: metrics.go

Configure Logging Levels

Manage and debug errors

Manage and debug errors

Manage authentication

Manage authentication

Enable basic authentication for Loki using nginx

Prerequisites

Configure nginx

Validate your nginx configuration

Update passwords

Validate passwords

Manage bloom filter building and querying (Experimental)

Manage bloom filter building and querying (Experimental)

`rate_limited`

`per_stream_rate_limit`

`stream_limit`

`line_too_long`

`invalid_labels`

`missing_labels`

`too_far_behind` and `out_of_order`

`greater_than_max_sample_age`

`too_far_in_future`

`max_label_names_per_series`

`label_name_too_long`

`label_value_too_long`

`duplicate_label_names`