<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Operations on Grafana Labs</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/</link><description>Recent content in Operations on Grafana Labs</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="/docs/enterprise-logs/v1.9.x/loki/operations/index.xml" rel="self" type="application/rss+xml"/><item><title>Observability</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/observability/</link><pubDate>Tue, 16 Jul 2024 15:42:20 +0000</pubDate><guid>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/observability/</guid><content><![CDATA[&lt;h1 id=&#34;observing-grafana-loki&#34;&gt;Observing Grafana Loki&lt;/h1&gt;
&lt;p&gt;Both Grafana Loki and Promtail expose a &lt;code&gt;/metrics&lt;/code&gt; endpoint that expose Prometheus
metrics. You will need a local Prometheus and add Loki and Promtail as targets.
See &lt;a href=&#34;https://prometheus.io/docs/prometheus/latest/configuration/configuration&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;configuring
Prometheus&lt;/a&gt;
for more information.&lt;/p&gt;
&lt;p&gt;All components of Loki expose the following metrics:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th&gt;Metric Name&lt;/th&gt;
              &lt;th&gt;Metric Type&lt;/th&gt;
              &lt;th&gt;Description&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_log_messages_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Total number of messages logged by Loki.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_request_duration_seconds&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Number of received HTTP requests.&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;The Loki Distributors expose the following metrics:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th&gt;Metric Name&lt;/th&gt;
              &lt;th&gt;Metric Type&lt;/th&gt;
              &lt;th&gt;Description&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_distributor_ingester_appends_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of batch appends sent to ingesters.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_distributor_ingester_append_failures_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of failed batch appends sent to ingesters.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_distributor_bytes_received_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of uncompressed bytes received per both tenant and retention hours.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_distributor_lines_received_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of log &lt;em&gt;entries&lt;/em&gt; received per tenant (not necessarily of &lt;em&gt;lines&lt;/em&gt;, as an entry can have more than one line of text).&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;The Loki Ingesters expose the following metrics:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th&gt;Metric Name&lt;/th&gt;
              &lt;th&gt;Metric Type&lt;/th&gt;
              &lt;th&gt;Description&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;cortex_ingester_flush_queue_length&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;The total number of series pending in the flush queue.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_chunk_store_index_entries_per_chunk&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Number of index entries written to storage per chunk.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_memory_chunks&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;The total number of chunks in memory.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_memory_streams&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;The total number of streams in memory.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_age_seconds&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Distribution of chunk ages when flushed.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_encode_time_seconds&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Distribution of chunk encode times.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_entries&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Distribution of lines per-chunk when flushed.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_size_bytes&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Distribution of chunk sizes when flushed.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_utilization&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Distribution of chunk utilization (filled uncompressed bytes vs maximum uncompressed bytes) when flushed.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_compression_ratio&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Distribution of chunk compression ratio when flushed.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunk_stored_bytes_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Total bytes stored in chunks per tenant.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunks_created_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of chunks created in the ingester.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_chunks_stored_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Total stored chunks per tenant.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_received_chunks&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of chunks sent by this ingester whilst joining during the handoff process.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_samples_per_chunk&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;The number of samples in a chunk.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_sent_chunks&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of chunks sent by this ingester whilst leaving during the handoff process.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_streams_created_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of streams created per tenant.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;loki_ingester_streams_removed_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;The total number of streams removed per tenant.&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;Promtail exposes these metrics:&lt;/p&gt;
&lt;section class=&#34;expand-table-wrapper&#34;&gt;&lt;div class=&#34;button-div&#34;&gt;
      &lt;button class=&#34;expand-table-btn&#34;&gt;Expand table&lt;/button&gt;
    &lt;/div&gt;&lt;div class=&#34;responsive-table-wrapper&#34;&gt;
    &lt;table&gt;
      &lt;thead&gt;
          &lt;tr&gt;
              &lt;th&gt;Metric Name&lt;/th&gt;
              &lt;th&gt;Metric Type&lt;/th&gt;
              &lt;th&gt;Description&lt;/th&gt;
          &lt;/tr&gt;
      &lt;/thead&gt;
      &lt;tbody&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_read_bytes_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;Number of bytes read.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_read_lines_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of lines read.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_dropped_bytes_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of bytes dropped because failed to be sent to the ingester after all retries.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_dropped_entries_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of log entries dropped because failed to be sent to the ingester after all retries.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_encoded_bytes_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of bytes encoded and ready to send.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_file_bytes_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;Number of bytes read from files.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_files_active_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;Number of active files.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_request_duration_seconds_count&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Histogram&lt;/td&gt;
              &lt;td&gt;Number of send requests.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_sent_bytes_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of bytes sent.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_sent_entries_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of log entries sent to the ingester.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_targets_active_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Gauge&lt;/td&gt;
              &lt;td&gt;Number of total active targets.&lt;/td&gt;
          &lt;/tr&gt;
          &lt;tr&gt;
              &lt;td&gt;&lt;code&gt;promtail_targets_failed_total&lt;/code&gt;&lt;/td&gt;
              &lt;td&gt;Counter&lt;/td&gt;
              &lt;td&gt;Number of total failed targets.&lt;/td&gt;
          &lt;/tr&gt;
      &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;/div&gt;
&lt;/section&gt;&lt;p&gt;Most of these metrics are counters and should continuously increase during normal operations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Your app emits a log line to a file that is tracked by Promtail.&lt;/li&gt;
&lt;li&gt;Promtail reads the new line and increases its counters.&lt;/li&gt;
&lt;li&gt;Promtail forwards the log line to a Loki distributor, where the received
counters should increase.&lt;/li&gt;
&lt;li&gt;The Loki distributor forwards the log line to a Loki ingester, where the
request duration counter should increase.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If Promtail uses any pipelines with metrics stages, those metrics will also be
exposed by Promtail at its &lt;code&gt;/metrics&lt;/code&gt; endpoint. See Promtail&amp;rsquo;s documentation on
&lt;a href=&#34;../../clients/promtail/pipelines/&#34;&gt;Pipelines&lt;/a&gt; for more information.&lt;/p&gt;
&lt;p&gt;An example Grafana dashboard was built by the community and is available as
dashboard &lt;a href=&#34;/dashboards/10004&#34;&gt;10004&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;metrics-cardinality&#34;&gt;Metrics cardinality&lt;/h2&gt;
&lt;p&gt;Some of the Loki observability metrics are emitted per tracked file (active), with the file path included in labels.
This increases the quantity of label values across the environment, thereby increasing cardinality. Best practices with Prometheus &lt;a href=&#34;https://prometheus.io/docs/practices/naming/#labels&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;labels&lt;/a&gt; discourage increasing cardinality in this way.
Review your emitted metrics before scraping with Prometheus, and configure the scraping to avoid this issue.&lt;/p&gt;
&lt;h2 id=&#34;mixins&#34;&gt;Mixins&lt;/h2&gt;
&lt;p&gt;The Loki repository has a &lt;a href=&#34;https://github.com/grafana/loki/blob/master/production/loki-mixin&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mixin&lt;/a&gt; that includes a
set of dashboards, recording rules, and alerts. Together, the mixin gives you a
comprehensive package for monitoring Loki in production.&lt;/p&gt;
&lt;p&gt;For more information about mixins, take a look at the docs for the
&lt;a href=&#34;https://github.com/monitoring-mixins/docs&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;monitoring-mixins project&lt;/a&gt;.&lt;/p&gt;
]]></content><description>&lt;h1 id="observing-grafana-loki">Observing Grafana Loki&lt;/h1>
&lt;p>Both Grafana Loki and Promtail expose a &lt;code>/metrics&lt;/code> endpoint that expose Prometheus
metrics. You will need a local Prometheus and add Loki and Promtail as targets.
See &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration" target="_blank" rel="noopener noreferrer">configuring
Prometheus&lt;/a>
for more information.&lt;/p></description></item><item><title>Overrides Exporter</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/overrides-exporter/</link><pubDate>Mon, 14 Apr 2025 21:05:47 +0000</pubDate><guid>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/overrides-exporter/</guid><content><![CDATA[&lt;p&gt;Loki is a multi-tenant system that supports applying limits to each tenant as a mechanism for resource management. The &lt;code&gt;overrides-exporter&lt;/code&gt; module exposes these limits as Prometheus metrics in order to help operators better understand tenant behavior.&lt;/p&gt;
&lt;h2 id=&#34;context&#34;&gt;Context&lt;/h2&gt;
&lt;p&gt;Configuration updates to tenant limits can be applied to Loki without restart via the &lt;a href=&#34;/docs/enterprise-logs/v1.9.x/loki/configuration/#runtime-configuration-file&#34;&gt;&lt;code&gt;runtime_config&lt;/code&gt; feature&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;example&#34;&gt;Example&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;overrides-exporter&lt;/code&gt; module is disabled by default. We recommend running a single instance per cluster to avoid issues with metric cardinality. The &lt;code&gt;overrides-exporter&lt;/code&gt; creates one metric for every scalar field in the limits configuration under the metric &lt;code&gt;loki_overrides_defaults&lt;/code&gt; with the default value for that field after loading the Loki configuration. It also exposes another metric for &lt;em&gt;every&lt;/em&gt; differing field for &lt;em&gt;every&lt;/em&gt; tenant.&lt;/p&gt;
&lt;p&gt;Using an example &lt;code&gt;runtime.yaml&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;YAML&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-yaml&#34;&gt;overrides:
  &amp;#34;tenant_1&amp;#34;:
    ingestion_rate_mb: 10
    max_streams_per_user: 100000
    max_chunks_per_query: 100000&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Launch an instance of the &lt;code&gt;overrides-exporter&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;shell&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-shell&#34;&gt;loki -target=overrides-exporter -runtime-config.file=runtime.yaml -config.file=basic_schema_config.yaml -server.http-listen-port=8080&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To inspect the tenant limit overrides:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;shell&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-shell&#34;&gt;$ curl -sq localhost:8080/metrics | grep override
# HELP loki_overrides Resource limit overrides applied to tenants
# TYPE loki_overrides gauge
loki_overrides{limit_name=&amp;#34;ingestion_rate_mb&amp;#34;,user=&amp;#34;tenant_1&amp;#34;} 10
loki_overrides{limit_name=&amp;#34;max_chunks_per_query&amp;#34;,user=&amp;#34;tenant_1&amp;#34;} 100000
loki_overrides{limit_name=&amp;#34;max_streams_per_user&amp;#34;,user=&amp;#34;tenant_1&amp;#34;} 100000
# HELP loki_overrides_defaults Default values for resource limit overrides applied to tenants
# TYPE loki_overrides_defaults gauge
loki_overrides_defaults{limit_name=&amp;#34;cardinality_limit&amp;#34;} 100000
loki_overrides_defaults{limit_name=&amp;#34;creation_grace_period&amp;#34;} 6e&amp;#43;11
loki_overrides_defaults{limit_name=&amp;#34;ingestion_burst_size_mb&amp;#34;} 6
loki_overrides_defaults{limit_name=&amp;#34;ingestion_rate_mb&amp;#34;} 4
loki_overrides_defaults{limit_name=&amp;#34;max_cache_freshness_per_query&amp;#34;} 6e&amp;#43;10
loki_overrides_defaults{limit_name=&amp;#34;max_chunks_per_query&amp;#34;} 2e&amp;#43;06
loki_overrides_defaults{limit_name=&amp;#34;max_concurrent_tail_requests&amp;#34;} 10
loki_overrides_defaults{limit_name=&amp;#34;max_entries_limit_per_query&amp;#34;} 5000
loki_overrides_defaults{limit_name=&amp;#34;max_global_streams_per_user&amp;#34;} 5000
loki_overrides_defaults{limit_name=&amp;#34;max_label_name_length&amp;#34;} 1024
loki_overrides_defaults{limit_name=&amp;#34;max_label_names_per_series&amp;#34;} 30
loki_overrides_defaults{limit_name=&amp;#34;max_label_value_length&amp;#34;} 2048
loki_overrides_defaults{limit_name=&amp;#34;max_line_size&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;max_queriers_per_tenant&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;max_query_length&amp;#34;} 2.5956e&amp;#43;15
loki_overrides_defaults{limit_name=&amp;#34;max_query_lookback&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;max_query_parallelism&amp;#34;} 32
loki_overrides_defaults{limit_name=&amp;#34;max_query_series&amp;#34;} 500
loki_overrides_defaults{limit_name=&amp;#34;max_streams_matchers_per_query&amp;#34;} 1000
loki_overrides_defaults{limit_name=&amp;#34;max_streams_per_user&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;min_sharding_lookback&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;per_stream_rate_limit&amp;#34;} 3.145728e&amp;#43;06
loki_overrides_defaults{limit_name=&amp;#34;per_stream_rate_limit_burst&amp;#34;} 1.572864e&amp;#43;07
loki_overrides_defaults{limit_name=&amp;#34;per_tenant_override_period&amp;#34;} 1e&amp;#43;10
loki_overrides_defaults{limit_name=&amp;#34;reject_old_samples_max_age&amp;#34;} 1.2096e&amp;#43;15
loki_overrides_defaults{limit_name=&amp;#34;retention_period&amp;#34;} 2.6784e&amp;#43;15
loki_overrides_defaults{limit_name=&amp;#34;ruler_evaluation_delay_duration&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_max_rule_groups_per_tenant&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_max_rules_per_rule_group&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_batch_send_deadline&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_capacity&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_max_backoff&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_max_samples_per_send&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_max_shards&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_min_backoff&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_queue_min_shards&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;ruler_remote_write_timeout&amp;#34;} 0
loki_overrides_defaults{limit_name=&amp;#34;split_queries_by_interval&amp;#34;} 0&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Alerts can be created based on these metrics to inform operators when tenants are close to hitting their limits allowing for increases to be applied before the tenant limits are exceeded.&lt;/p&gt;
]]></content><description>&lt;p>Loki is a multi-tenant system that supports applying limits to each tenant as a mechanism for resource management. The &lt;code>overrides-exporter&lt;/code> module exposes these limits as Prometheus metrics in order to help operators better understand tenant behavior.&lt;/p></description></item><item><title>Storage</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/storage/</link><pubDate>Mon, 14 Apr 2025 21:05:47 +0000</pubDate><guid>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/storage/</guid><content><![CDATA[&lt;h1 id=&#34;grafana-loki-storage&#34;&gt;Grafana Loki Storage&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;/docs/enterprise-logs/v1.9.x/loki/storage/&#34;&gt;High level storage overview here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Grafana Loki needs to store two different types of data: &lt;strong&gt;chunks&lt;/strong&gt; and &lt;strong&gt;indexes&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Loki receives logs in separate streams, where each stream is uniquely identified
by its tenant ID and its set of labels. As log entries from a stream arrive,
they are compressed as &amp;ldquo;chunks&amp;rdquo; and saved in the chunks store. See &lt;a href=&#34;#chunk-format&#34;&gt;chunk
format&lt;/a&gt; for how chunks are stored internally.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;index&lt;/strong&gt; stores each stream&amp;rsquo;s label set and links them to the individual
chunks.&lt;/p&gt;
&lt;p&gt;Refer to Loki&amp;rsquo;s &lt;a href=&#34;../../configuration/&#34;&gt;configuration&lt;/a&gt; for details on
how to configure the storage and the index.&lt;/p&gt;
&lt;p&gt;For more information:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;table-manager/&#34;&gt;Table Manager&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;retention/&#34;&gt;Retention&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;logs-deletion/&#34;&gt;Logs Deletion&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;supported-stores&#34;&gt;Supported Stores&lt;/h2&gt;
&lt;p&gt;The following are supported for the index:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;boltdb-shipper/&#34;&gt;Single Store (boltdb-shipper) - Recommended for 2.0 and newer&lt;/a&gt; index store which stores boltdb index files in the object store&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://aws.amazon.com/dynamodb&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Amazon DynamoDB&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cloud.google.com/bigtable&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Google Bigtable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cassandra.apache.org&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Apache Cassandra&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/boltdb/bolt&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;BoltDB&lt;/a&gt; (doesn&amp;rsquo;t work when clustering Loki)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following are supported for the chunks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://aws.amazon.com/dynamodb&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Amazon DynamoDB&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cloud.google.com/bigtable&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Google Bigtable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cassandra.apache.org&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Apache Cassandra&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://aws.amazon.com/s3&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Amazon S3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cloud.google.com/storage/&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Google Cloud Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;filesystem/&#34;&gt;Filesystem&lt;/a&gt; (read more about the filesystem to understand the pros/cons before using with production data)&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://cloud.baidu.com/product/bos.html&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Baidu Object Storage&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cloud-storage-permissions&#34;&gt;Cloud Storage Permissions&lt;/h2&gt;
&lt;h3 id=&#34;s3&#34;&gt;S3&lt;/h3&gt;
&lt;p&gt;When using S3 as object storage, the following permissions are needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;s3:ListBucket&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;s3:PutObject&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;s3:GetObject&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;s3:DeleteObject&lt;/code&gt; (if running the Single Store (boltdb-shipper) compactor)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Resources: &lt;code&gt;arn:aws:s3:::&amp;lt;bucket_name&amp;gt;&lt;/code&gt;, &lt;code&gt;arn:aws:s3:::&amp;lt;bucket_name&amp;gt;/*&lt;/code&gt;&lt;/p&gt;
&lt;h3 id=&#34;dynamodb&#34;&gt;DynamoDB&lt;/h3&gt;
&lt;p&gt;When using DynamoDB for the index, the following permissions are needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dynamodb:BatchGetItem&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:BatchWriteItem&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:DeleteItem&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:DescribeTable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:GetItem&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:ListTagsOfResource&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:PutItem&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:Query&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:TagResource&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:UntagResource&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:UpdateItem&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:UpdateTable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:CreateTable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dynamodb:DeleteTable&lt;/code&gt; (if &lt;code&gt;table_manager.retention_period&lt;/code&gt; is more than 0s)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Resources: &lt;code&gt;arn:aws:dynamodb:&amp;lt;aws_region&amp;gt;:&amp;lt;aws_account_id&amp;gt;:table/&amp;lt;prefix&amp;gt;*&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dynamodb:ListTables&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Resources: &lt;code&gt;*&lt;/code&gt;&lt;/p&gt;
&lt;h4 id=&#34;autoscaling&#34;&gt;AutoScaling&lt;/h4&gt;
&lt;p&gt;If you enable autoscaling from table manager, the following permissions are needed:&lt;/p&gt;
&lt;h5 id=&#34;application-autoscaling&#34;&gt;Application Autoscaling&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;application-autoscaling:DescribeScalableTargets&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application-autoscaling:DescribeScalingPolicies&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application-autoscaling:RegisterScalableTarget&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application-autoscaling:DeregisterScalableTarget&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application-autoscaling:PutScalingPolicy&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application-autoscaling:DeleteScalingPolicy&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Resources: &lt;code&gt;*&lt;/code&gt;&lt;/p&gt;
&lt;h5 id=&#34;iam&#34;&gt;IAM&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;iam:GetRole&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;iam:PassRole&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Resources: &lt;code&gt;arn:aws:iam::&amp;lt;aws_account_id&amp;gt;:role/&amp;lt;role_name&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;h2 id=&#34;chunk-format&#34;&gt;Chunk Format&lt;/h2&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;  -------------------------------------------------------------------
  |                               |                                 |
  |        MagicNumber(4b)        |           version(1b)           |
  |                               |                                 |
  -------------------------------------------------------------------
  |         block-1 bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |         block-2 bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |         block-n bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |                        #blocks (uvarint)                        |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  |                      checksum(from #blocks)                     |
  -------------------------------------------------------------------
  |           metasOffset - offset to the point with #blocks        |
  -------------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
]]></content><description>&lt;h1 id="grafana-loki-storage">Grafana Loki Storage&lt;/h1>
&lt;p>&lt;a href="/docs/enterprise-logs/v1.9.x/loki/storage/">High level storage overview here&lt;/a>&lt;/p>
&lt;p>Grafana Loki needs to store two different types of data: &lt;strong>chunks&lt;/strong> and &lt;strong>indexes&lt;/strong>.&lt;/p>
&lt;p>Loki receives logs in separate streams, where each stream is uniquely identified
by its tenant ID and its set of labels. As log entries from a stream arrive,
they are compressed as &amp;ldquo;chunks&amp;rdquo; and saved in the chunks store. See &lt;a href="#chunk-format">chunk
format&lt;/a> for how chunks are stored internally.&lt;/p></description></item><item><title>Loki Canary</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/loki-canary/</link><pubDate>Tue, 16 Jul 2024 15:42:20 +0000</pubDate><guid>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/loki-canary/</guid><content><![CDATA[&lt;h1 id=&#34;loki-canary&#34;&gt;Loki Canary&lt;/h1&gt;
&lt;p&gt;Loki Canary is a standalone app that audits the log-capturing performance of
a Grafana Loki cluster.&lt;/p&gt;
&lt;p&gt;Loki Canary generates artificial log lines.
These log lines are sent to the Loki cluster.
Loki Canary communicates with the Loki cluster to capture metrics about the
artificial log lines,
such that Loki Canary forms information about the performance of the
Loki cluster.
The information is available as Prometheus time series metrics.&lt;/p&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../loki-canary-block.png&#34;
  alt=&#34;block_diagram&#34;/&gt;&lt;/p&gt;
&lt;p&gt;Loki Canary writes a log to a file and stores the timestamp in an internal
array. The contents look something like this:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;nohighlight&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-nohighlight&#34;&gt;1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The relevant part of the log entry is the timestamp; the &lt;code&gt;p&lt;/code&gt;s are just filler
bytes to make the size of the log configurable.&lt;/p&gt;
&lt;p&gt;An agent (like Promtail) should be configured to read the log file and ship it
to Loki.&lt;/p&gt;
&lt;p&gt;Meanwhile, Loki Canary will open a WebSocket connection to Loki and will tail
the logs it creates. When a log is received on the WebSocket, the timestamp
in the log message is compared to the internal array.&lt;/p&gt;
&lt;p&gt;If the received log is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The next in the array to be received, it is removed from the array and the
(current time - log timestamp) is recorded in the &lt;code&gt;response_latency&lt;/code&gt;
histogram. This is the expected behavior for well behaving logs.&lt;/li&gt;
&lt;li&gt;Not the next in the array to be received, it is removed from the array, the
response time is recorded in the &lt;code&gt;response_latency&lt;/code&gt; histogram, and the
&lt;code&gt;out_of_order_entries&lt;/code&gt; counter is incremented.&lt;/li&gt;
&lt;li&gt;Not in the array at all, it is checked against a separate list of received
logs to either increment the &lt;code&gt;duplicate_entries&lt;/code&gt; counter or the
&lt;code&gt;unexpected_entries&lt;/code&gt; counter.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the background, Loki Canary also runs a timer which iterates through all of
the entries in the internal array. If any of the entries are older than the
duration specified by the &lt;code&gt;-wait&lt;/code&gt; flag (defaulting to 60s), they are removed
from the array and the &lt;code&gt;websocket_missing_entries&lt;/code&gt; counter is incremented. An
additional query is then made directly to Loki for any missing entries to
determine if they are truly missing or only missing from the WebSocket. If
missing entries are not found in the direct query, the &lt;code&gt;missing_entries&lt;/code&gt; counter
is incremented.&lt;/p&gt;
&lt;h3 id=&#34;additional-queries&#34;&gt;Additional Queries&lt;/h3&gt;
&lt;h4 id=&#34;spot-check&#34;&gt;Spot Check&lt;/h4&gt;
&lt;p&gt;Starting with version 1.6.0, the canary will spot check certain results over time
to make sure they are present in Loki, this is helpful for testing the transition
of inmemory logs in the ingester to the store to make sure nothing is lost.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;-spot-check-interval&lt;/code&gt; and &lt;code&gt;-spot-check-max&lt;/code&gt; are used to tune this feature,
&lt;code&gt;-spot-check-interval&lt;/code&gt; will pull a log entry from the stream at this interval
and save it in a separate list up to &lt;code&gt;-spot-check-max&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Every &lt;code&gt;-spot-check-query-rate&lt;/code&gt;, Loki will be queried for each entry in this list and
&lt;code&gt;loki_canary_spot_check_entries_total&lt;/code&gt; will be incremented, if a result
is missing &lt;code&gt;loki_canary_spot_check_missing_entries_total&lt;/code&gt; will be incremented.&lt;/p&gt;
&lt;p&gt;The defaults of &lt;code&gt;15m&lt;/code&gt; for &lt;code&gt;spot-check-interval&lt;/code&gt; and &lt;code&gt;4h&lt;/code&gt; for &lt;code&gt;spot-check-max&lt;/code&gt;
means that after 4 hours of running the canary will have a list of 16 entries
it will query every minute (default &lt;code&gt;spot-check-query-rate&lt;/code&gt; interval is 1m),
so be aware of the query load this can put on Loki if you have a lot of canaries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; if you are using &lt;code&gt;out-of-order-percentage&lt;/code&gt; to test ingestion of out-of-order
log lines be sure not to set the two out of order time range flags too far in the past.
The defaults are already enough to test this functionality properly, and setting them
too far in the past can cause issues with the spot check test.&lt;/p&gt;
&lt;p&gt;When using &lt;code&gt;out-of-order-percentage&lt;/code&gt; you also need to make use of pipeline stages
in your Promtail configuration in order to set the timestamps correctly as the logs are pushed
to Loki. The &lt;code&gt;client/promtail/pipelines&lt;/code&gt; docs have examples of how to do this.&lt;/p&gt;
&lt;h4 id=&#34;metric-test&#34;&gt;Metric Test&lt;/h4&gt;
&lt;p&gt;Loki Canary will run a metric query &lt;code&gt;count_over_time&lt;/code&gt; to
verify that the rate of logs being stored in Loki corresponds to the rate they are being
created by Loki Canary.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;-metric-test-interval&lt;/code&gt; and &lt;code&gt;-metric-test-range&lt;/code&gt; are used to tune this feature, but
by default every &lt;code&gt;15m&lt;/code&gt; the canary will run a &lt;code&gt;count_over_time&lt;/code&gt; instant-query to Loki
for a range of &lt;code&gt;24h&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If the canary has not run for &lt;code&gt;-metric-test-range&lt;/code&gt; (&lt;code&gt;24h&lt;/code&gt;) the query range is adjusted
to the amount of time the canary has been running such that the rate can be calculated
since the canary was started.&lt;/p&gt;
&lt;p&gt;The canary calculates what the expected count of logs would be for the range
(also adjusting this based on canary runtime) and compares the expected result with
the actual result returned from Loki.  The &lt;em&gt;difference&lt;/em&gt; is stored as the value in
the gauge &lt;code&gt;loki_canary_metric_test_deviation&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s expected that there will be some deviation, the method of creating an expected
calculation based on the query rate compared to actual query data is imperfect
and will lead to a deviation of a few log entries.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s not expected for there to be a deviation of more than 3-4 log entries.&lt;/p&gt;
&lt;h3 id=&#34;control&#34;&gt;Control&lt;/h3&gt;
&lt;p&gt;Loki Canary responds to two endpoints to allow dynamic suspending/resuming of the
canary process.  This can be useful if you&amp;rsquo;d like to quickly disable or reenable the
canary.  To stop or start the canary issue an HTTP GET request against the &lt;code&gt;/suspend&lt;/code&gt; or
&lt;code&gt;/resume&lt;/code&gt; endpoints.&lt;/p&gt;
&lt;h2 id=&#34;installation&#34;&gt;Installation&lt;/h2&gt;
&lt;h3 id=&#34;binary&#34;&gt;Binary&lt;/h3&gt;
&lt;p&gt;Loki Canary is provided as a pre-compiled binary as part of the
&lt;a href=&#34;https://github.com/grafana/loki/releases&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Loki Releases&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;h3 id=&#34;docker&#34;&gt;Docker&lt;/h3&gt;
&lt;p&gt;Loki Canary is also provided as a Docker container image:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;Bash&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-bash&#34;&gt;# change tag to the most recent release
$ docker pull grafana/loki-canary:2.0.0&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;h3 id=&#34;kubernetes&#34;&gt;Kubernetes&lt;/h3&gt;
&lt;p&gt;To run on Kubernetes, you can do something simple like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;kubectl run loki-canary --generator=run-pod/v1 --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=IfNotPresent --labels=name=loki-canary -- -addr=loki:3100&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Or you can do something more complex like deploy it as a DaemonSet, there is a
Tanka setup for this in the &lt;code&gt;production&lt;/code&gt; folder, you can import it using
&lt;code&gt;jsonnet-bundler&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;shell&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-shell&#34;&gt;jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Then in your Tanka environment&amp;rsquo;s &lt;code&gt;main.jsonnet&lt;/code&gt; you&amp;rsquo;ll want something like
this:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;jsonnet&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-jsonnet&#34;&gt;local loki_canary = import &amp;#39;loki-canary/loki-canary.libsonnet&amp;#39;;

loki_canary {
  loki_canary_args&amp;#43;:: {
    addr: &amp;#34;loki:3100&amp;#34;,
    port: 80,
    labelname: &amp;#34;instance&amp;#34;,
    interval: &amp;#34;100ms&amp;#34;,
    size: 1024,
    wait: &amp;#34;3m&amp;#34;,
  },
  _config&amp;#43;:: {
    namespace: &amp;#34;default&amp;#34;,
  }
}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;h4 id=&#34;examples&#34;&gt;Examples&lt;/h4&gt;
&lt;p&gt;Standalone Pod Implementation of loki-canary&lt;/p&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;---
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: loki-canary
    name: loki-canary
  name: loki-canary
spec:
  containers:
  - args:
    - -addr=loki:3100
    image: grafana/loki-canary:latest
    imagePullPolicy: IfNotPresent
    name: loki-canary
    resources: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki-canary
  labels:
    app: loki-canary
spec:
  type: ClusterIP
  selector:
    app: loki-canary
  ports:
  - name: metrics
    protocol: TCP
    port: 3500
    targetPort: 3500&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;DaemonSet Implementation of loki-canary&lt;/p&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;---
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  labels:
    app: loki-canary
    name: loki-canary
  name: loki-canary
spec:
  template:
    metadata:
      name: loki-canary
      labels:
        app: loki-canary
    spec:
      containers:
      - args:
        - -addr=loki:3100
        image: grafana/loki-canary:latest
        imagePullPolicy: IfNotPresent
        name: loki-canary
        resources: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki-canary
  labels:
    app: loki-canary
spec:
  type: ClusterIP
  selector:
    app: loki-canary
  ports:
  - name: metrics
    protocol: TCP
    port: 3500
    targetPort: 3500&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;h3 id=&#34;from-source&#34;&gt;From Source&lt;/h3&gt;
&lt;p&gt;If the other options are not sufficient for your use case, you can compile
&lt;code&gt;loki-canary&lt;/code&gt; yourself:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;Bash&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-bash&#34;&gt;# clone the source tree
$ git clone https://github.com/grafana/loki

# build the binary
$ make loki-canary

# (optionally build the container image)
$ make loki-canary-image&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;h2 id=&#34;configuration&#34;&gt;Configuration&lt;/h2&gt;
&lt;p&gt;The address of Loki must be passed in with the &lt;code&gt;-addr&lt;/code&gt; flag, and if your Loki
server uses TLS, &lt;code&gt;-tls=true&lt;/code&gt; must also be provided. Note that using TLS will
cause the WebSocket connection to use &lt;code&gt;wss://&lt;/code&gt; instead of &lt;code&gt;ws://&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-labelname&lt;/code&gt; and &lt;code&gt;-labelvalue&lt;/code&gt; flags should also be provided, as these are
used by Loki Canary to filter the log stream to only process logs for the
current instance of the canary. Ensure that the values provided to the flags are
unique to each instance of Loki Canary. Grafana Labs&amp;rsquo; Tanka config
accomplishes this by passing in the pod name as the label value.&lt;/p&gt;
&lt;p&gt;If Loki Canary reports a high number of &lt;code&gt;unexpected_entries&lt;/code&gt;, Loki Canary may
not be waiting long enough and the value for the &lt;code&gt;-wait&lt;/code&gt; flag should be
increased to a larger value than 60s.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Be aware&lt;/strong&gt; of the relationship between &lt;code&gt;pruneinterval&lt;/code&gt; and the &lt;code&gt;interval&lt;/code&gt;.
For example, with an interval of 10ms (100 logs per second) and a prune interval
of 60s, you will write 6000 logs per minute. If those logs were not received
over the WebSocket, the canary will attempt to query Loki directly to see if
they are completely lost. &lt;strong&gt;However&lt;/strong&gt; the query return is limited to 1000
results so you will not be able to return all the logs even if they did make it
to Loki.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Likewise&lt;/strong&gt;, if you lower the &lt;code&gt;pruneinterval&lt;/code&gt; you risk causing a denial of
service attack as all your canaries attempt to query for missing logs at
whatever your &lt;code&gt;pruneinterval&lt;/code&gt; is defined at.&lt;/p&gt;
&lt;p&gt;All options:&lt;/p&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;  -addr string
        The Loki server URL:Port, e.g. loki:3100
  -buckets int
        Number of buckets in the response_latency histogram (default 10)
  -interval duration
        Duration between log entries (default 1s)
  -labelname string
        The label name for this instance of Loki Canary to use in the log selector
        (default &amp;#34;name&amp;#34;)
  -labelvalue string
        The unique label value for this instance of Loki Canary to use in the log selector
        (default &amp;#34;loki-canary&amp;#34;)
  -metric-test-interval duration
        The interval the metric test query should be run (default 1h0m0s)
  -metric-test-range duration
        The range value [24h] used in the metric test instant-query. This value is truncated
        to the running time of the canary until this value is reached (default 24h0m0s)
  -out-of-order-max duration
    	  Maximum amount of time (in seconds) in the past an out of order entry may have as a
          timestamp. (default 60s)
  -out-of-order-min duration
    	  Minimum amount of time (in seconds) in the past an out of order entry may have as a
          timestamp. (default 30s)
  -out-of-order-percentage int
      	Percentage (0-100) of log entries that should be sent out of order
  -pass string
        Loki password
  -port int
        Port which Loki Canary should expose metrics (default 3500)
  -pruneinterval duration
        Frequency to check sent versus received logs, and also the frequency at which queries
        for missing logs will be dispatched to Loki, and the frequency spot check queries are run
        (default 1m0s)
  -query-timeout duration
        How long to wait for a query response from Loki (default 10s)
  -size int
        Size in bytes of each log line (default 100)
  -spot-check-interval duration
        Interval that a single result will be kept from sent entries and spot-checked against
        Loki. For example, with the 15 minute default, one entry every 15 minutes will be saved,
        and then queried again every 15 minutes until the time defined by spot-check-max is
        reached (default 15m0s)
  -spot-check-max duration
        How far back to check a spot check an entry before dropping it (default 4h0m0s)
  -spot-check-query-rate duration
        Interval that Loki Canary will query Loki for the current list of all spot check entries
        (default 1m0s)
  -streamname string
        The stream name for this instance of Loki Canary to use in the log selector
        (default &amp;#34;stream&amp;#34;)
  -streamvalue string
        The unique stream value for this instance of Loki Canary to use in the log selector
        (default &amp;#34;stdout&amp;#34;)
  -tenant-id string
        Tenant ID to be set in X-Scope-OrgID header.
  -tls
        Does the Loki connection use TLS?
  -user string
        Loki user name
  -version
        Print this build&amp;#39;s version information
  -wait duration
        Duration to wait for log entries before reporting them as lost (default 1m0s)&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
]]></content><description>&lt;h1 id="loki-canary">Loki Canary&lt;/h1>
&lt;p>Loki Canary is a standalone app that audits the log-capturing performance of
a Grafana Loki cluster.&lt;/p>
&lt;p>Loki Canary generates artificial log lines.
These log lines are sent to the Loki cluster.
Loki Canary communicates with the Loki cluster to capture metrics about the
artificial log lines,
such that Loki Canary forms information about the performance of the
Loki cluster.
The information is available as Prometheus time series metrics.&lt;/p></description></item><item><title>Shuffle sharding</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/shuffle-sharding/</link><pubDate>Tue, 16 Jul 2024 15:42:20 +0000</pubDate><guid>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/shuffle-sharding/</guid><content><![CDATA[&lt;h1 id=&#34;shuffle-sharding&#34;&gt;Shuffle sharding&lt;/h1&gt;
&lt;p&gt;Shuffle sharding is a resource-management technique used to isolate tenant workloads from other tenant workloads, to give each tenant more of a single-tenant experience when running in a shared cluster.
This technique is explained by AWS in their article &lt;a href=&#34;https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Workload isolation using shuffle-sharding&lt;/a&gt;.
A reference implementation has been shown in the &lt;a href=&#34;https://github.com/awslabs/route53-infima/blob/master/src/main/java/com/amazonaws/services/route53/infima/SimpleSignatureShuffleSharder.java&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Route53 Infima library&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;the-issues-that-shuffle-sharding-mitigates&#34;&gt;The issues that shuffle sharding mitigates&lt;/h2&gt;
&lt;p&gt;Shuffle sharding can be configured for the query path.&lt;/p&gt;
&lt;p&gt;The query path is sharded by default, and the default does not use shuffle sharding.
Each tenant’s query is sharded across all queriers, so the workload uses all querier instances.&lt;/p&gt;
&lt;p&gt;In a multi-tenant cluster, sharding across all instances of a component may exhibit these issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Any outage of a component instance affects all tenants&lt;/li&gt;
&lt;li&gt;A misbehaving tenant affects all other tenants&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An individual query may create issues for all tenants.
A single tenant or a group of tenants may issue an expensive query:
one that causes a querier component to hit an out-of-memory error,
or one that causes a querier component to crash.
Once the error occurs,
the tenant or tenants issuing the error-causing query will be reassigned
to other running queriers,
up to the limit imposed by the &lt;code&gt;max_queriers_per_tenant&lt;/code&gt; configuration.
This, in turn, may affect the queriers that have been reassigned.&lt;/p&gt;
&lt;h2 id=&#34;how-shuffle-sharding-works&#34;&gt;How shuffle sharding works&lt;/h2&gt;
&lt;p&gt;The idea of shuffle sharding is to assign each tenant to a shard composed by a subset of the Loki queriers, aiming to minimize the overlapping instances between distinct tenants.&lt;/p&gt;
&lt;p&gt;A misbehaving tenant will affect only its shard&amp;rsquo;s queriers. Due to the low overlap of queriers among tenants, only a small subset of tenants will be affected bythe misbehaving tenant.
Shuffle sharding requires no more resources than the default sharding strategy.&lt;/p&gt;
&lt;p&gt;Shuffle sharding does not fix all issues.
If a tenant repeatedly sends a problematic query, the crashed querier
will be disconnected from the query-frontend, and a new querier
will be immediately assigned to the tenant’s shard.
This invalidates the positive effects of shuffle sharding.
In this case,
configuring a delay between when a querier disconnects because of a crash,
and when the crashed querier is actually removed from the tenant’s shard
and another healthy querier is added as a replacement improves the situation.
A delay of 1 minute may be a reasonable value in
the query-frontend with configuration parameter
&lt;code&gt;-query-frontend.querier-forget-delay=1m&lt;/code&gt;, and in the query-scheduler with configuration parameter
&lt;code&gt;-query-scheduler.querier-forget-delay=1m&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&#34;low-probability-of-overlapping-instances&#34;&gt;Low probability of overlapping instances&lt;/h3&gt;
&lt;p&gt;If an example Loki cluster runs 50 queriers and assigns each tenant 4 out of 50 queriers, shuffling instances between each tenant, there are 230K possible combinations.&lt;/p&gt;
&lt;p&gt;Statistically, randomly picking two distinct tenants, there is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a 71% chance that they will not share any instance&lt;/li&gt;
&lt;li&gt;a 26% chance that they will share only 1 instance&lt;/li&gt;
&lt;li&gt;a 2.7% chance that they will share 2 instances&lt;/li&gt;
&lt;li&gt;a 0.08% chance that they will share 3 instances&lt;/li&gt;
&lt;li&gt;only a 0.0004% chance that their instances will fully overlap&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../shuffle-sharding-probability.png&#34;
  alt=&#34;overlapping instances probability&#34;/&gt;&lt;/p&gt;
&lt;h2 id=&#34;configuration&#34;&gt;Configuration&lt;/h2&gt;
&lt;p&gt;Enable shuffle sharding by setting &lt;code&gt;-frontend.max-queriers-per-tenant&lt;/code&gt; to a value higher than 0 and lower than the number of available queriers.
The value of the per-tenant configuration
&lt;code&gt;max_queriers_per_tenant&lt;/code&gt; sets the quantity of allocated queriers.
This option is only available when using the query-frontend, with or without a scheduler.&lt;/p&gt;
&lt;p&gt;The per-tenant configuration parameter
&lt;code&gt;max_query_parallelism&lt;/code&gt; describes how many sub queries, after query splitting and query sharding, can be scheduled to run at the same time for each request of any tenant.&lt;/p&gt;
&lt;p&gt;Configuration parameter
&lt;code&gt;querier.concurrency&lt;/code&gt; controls the quanity of worker threads (goroutines) per single querier.&lt;/p&gt;
&lt;p&gt;The maximum number of queriers can be overridden on a per-tenant basis in the limits overrides configuration by &lt;code&gt;max_queriers_per_tenant&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;shuffle-sharding-metrics&#34;&gt;Shuffle sharding metrics&lt;/h2&gt;
&lt;p&gt;These metrics reveal information relevant to shuffle sharding:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the overall query-scheduler queue duration,  &lt;code&gt;cortex_query_scheduler_queue_duration_seconds_*&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the query-scheduler queue length per tenant, &lt;code&gt;cortex_query_scheduler_queue_length&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the query-scheduler queue duration per tenant can be found with this query:&lt;/p&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;max_over_time({cluster=&amp;#34;$cluster&amp;#34;,container=&amp;#34;query-frontend&amp;#34;, namespace=&amp;#34;$namespace&amp;#34;} |= &amp;#34;metrics.go&amp;#34; |logfmt | unwrap duration(queue_time) | __error__=&amp;#34;&amp;#34; [5m]) by (org_id)&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Too many spikes in any of these metrics may imply:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A particular tenant is trying to use more query resources than they were allocated.&lt;/li&gt;
&lt;li&gt;That tenant may need an increase in the value of &lt;code&gt;max_queriers_per_tenant&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Loki instances may be under provisioned.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A useful query checks how many queriers are being used by each tenant:&lt;/p&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;count by (org_id) (sum by (org_id, pod) (count_over_time({job=&amp;#34;$namespace/querier&amp;#34;, cluster=&amp;#34;$cluster&amp;#34;} |= &amp;#34;metrics.go&amp;#34; | logfmt [$__interval])))&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
]]></content><description>&lt;h1 id="shuffle-sharding">Shuffle sharding&lt;/h1>
&lt;p>Shuffle sharding is a resource-management technique used to isolate tenant workloads from other tenant workloads, to give each tenant more of a single-tenant experience when running in a shared cluster.
This technique is explained by AWS in their article &lt;a href="https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/" target="_blank" rel="noopener noreferrer">Workload isolation using shuffle-sharding&lt;/a>.
A reference implementation has been shown in the &lt;a href="https://github.com/awslabs/route53-infima/blob/master/src/main/java/com/amazonaws/services/route53/infima/SimpleSignatureShuffleSharder.java" target="_blank" rel="noopener noreferrer">Route53 Infima library&lt;/a>.&lt;/p></description></item><item><title>Recording Rules</title><link>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/recording-rules/</link><pubDate>Tue, 16 Jul 2024 15:42:20 +0000</pubDate><guid>https://grafana.com/docs/enterprise-logs/v1.9.x/loki/operations/recording-rules/</guid><content><![CDATA[&lt;h1 id=&#34;recording-rules&#34;&gt;Recording Rules&lt;/h1&gt;
&lt;p&gt;Recording rules are evaluated by the &lt;code&gt;ruler&lt;/code&gt; component. Each &lt;code&gt;ruler&lt;/code&gt; acts as its own &lt;code&gt;querier&lt;/code&gt;, in the sense that it
executes queries against the store without using the &lt;code&gt;query-frontend&lt;/code&gt; or &lt;code&gt;querier&lt;/code&gt; components. It will respect all query
&lt;a href=&#34;/docs/loki/latest/configuration/#limits_config&#34;&gt;limits&lt;/a&gt; put in place for the &lt;code&gt;querier&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Loki&amp;rsquo;s implementation of recording rules largely reuses Prometheus&amp;rsquo; code.&lt;/p&gt;
&lt;p&gt;Samples generated by recording rules are sent to Prometheus using Prometheus&amp;rsquo; &lt;strong&gt;remote-write&lt;/strong&gt; feature.&lt;/p&gt;
&lt;h2 id=&#34;write-ahead-log-wal&#34;&gt;Write-Ahead Log (WAL)&lt;/h2&gt;
&lt;p&gt;All samples generated by recording rules are written to a WAL. The WAL&amp;rsquo;s main benefit is that it persists the samples
generated by recording rules to disk, which means that if your &lt;code&gt;ruler&lt;/code&gt; crashes, you won&amp;rsquo;t lose any data.
We are trading off extra memory usage and slower start-up times for this functionality.&lt;/p&gt;
&lt;p&gt;A WAL is created per tenant; this is done to prevent cross-tenant interactions. If all samples were to be written
to a single WAL, this would increase the chances that one tenant could cause data-loss for others. A typical scenario here
is that Prometheus will, for example, reject a remote-write request with 100 samples if just 1 of those samples is invalid in some way.&lt;/p&gt;
&lt;h3 id=&#34;start-up&#34;&gt;Start-up&lt;/h3&gt;
&lt;p&gt;When the &lt;code&gt;ruler&lt;/code&gt; starts up, it will load the WALs for the tenants who have recording rules. These WAL files are stored
on disk and are loaded into memory.&lt;/p&gt;
&lt;p&gt;Note: WALs are loaded one at a time upon start-up. This is a current limitation of the Loki ruler.
For this reason, it is adviseable that the number of rule groups serviced by a ruler be kept to a reasonable size, since
&lt;em&gt;no rule evaluation occurs while WAL replay is in progress (this includes alerting rules)&lt;/em&gt;.&lt;/p&gt;
&lt;h3 id=&#34;truncation&#34;&gt;Truncation&lt;/h3&gt;
&lt;p&gt;WAL files are regularly truncated to reduce their size on disk.
&lt;a href=&#34;https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/#wal-truncation-and-checkpointing&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;This guide&lt;/a&gt;
from one of the Prometheus maintainers (Ganesh Vernekar) gives an excellent overview of the truncation, checkpointing,
and replaying of the WAL.&lt;/p&gt;
&lt;h3 id=&#34;cleaner&#34;&gt;Cleaner&lt;/h3&gt;
&lt;p&gt;&lt;span style=&#34;background-color:#f3f973;&#34;&gt;WAL Cleaner is an experimental feature.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The WAL Cleaner watches for abandoned WALs (tenants who no longer have recording rules associated) and deletes them.
Enable this feature only if you are running into storage concerns with WALs that are too large. WALs should not grow
excessively large due to truncation.&lt;/p&gt;
&lt;h2 id=&#34;scaling&#34;&gt;Scaling&lt;/h2&gt;
&lt;p&gt;See Mimir&amp;rsquo;s guide for &lt;a href=&#34;/docs/mimir/latest/configure/configure-hash-rings/&#34;&gt;configuring Grafana Mimir hash rings&lt;/a&gt; for scaling the ruler using a ring.&lt;/p&gt;
&lt;p&gt;Note: the &lt;code&gt;ruler&lt;/code&gt; shards by rule &lt;em&gt;group&lt;/em&gt;, not by individual rules. This is an artifact of the fact that Prometheus
recording rules need to run in order since one recording rule can reuse another - but this is not possible in Loki.&lt;/p&gt;
&lt;h2 id=&#34;deployment&#34;&gt;Deployment&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;ruler&lt;/code&gt; needs to persist its WAL files to disk, and it incurs a bit of a start-up cost by reading these WALs into memory.
As such, it is recommended that you try to minimize churn of individual &lt;code&gt;ruler&lt;/code&gt; instances since rule evaluation is blocked
while the WALs are being read from disk.&lt;/p&gt;
&lt;h3 id=&#34;kubernetes&#34;&gt;Kubernetes&lt;/h3&gt;
&lt;p&gt;It is recommended that you run the &lt;code&gt;rulers&lt;/code&gt; using &lt;code&gt;StatefulSets&lt;/code&gt;. The &lt;code&gt;ruler&lt;/code&gt; will write its WAL files to persistent storage,
so a &lt;code&gt;Persistent Volume&lt;/code&gt; should be utilised.&lt;/p&gt;
&lt;h2 id=&#34;remote-write&#34;&gt;Remote-Write&lt;/h2&gt;
&lt;h3 id=&#34;per-tenant-limits&#34;&gt;Per-Tenant Limits&lt;/h3&gt;
&lt;p&gt;Remote-write can be configured at a global level in the base configuration, and certain parameters tuned specifically on
a per-tenant basis. Most of the configuration options &lt;a href=&#34;../../configuration/#ruler&#34;&gt;defined here&lt;/a&gt;
have &lt;a href=&#34;../../configuration/#limits_config&#34;&gt;override options&lt;/a&gt; (which can be also applied at runtime!).&lt;/p&gt;
&lt;h3 id=&#34;tuning&#34;&gt;Tuning&lt;/h3&gt;
&lt;p&gt;Remote-write can be tuned if the default configuration is insufficient (see &lt;a href=&#34;#failure-modes&#34;&gt;Failure Modes&lt;/a&gt; below).&lt;/p&gt;
&lt;p&gt;There is a &lt;a href=&#34;https://prometheus.io/docs/practices/remote_write/&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;guide&lt;/a&gt; on the Prometheus website, all of which applies to Loki, too.&lt;/p&gt;
&lt;h2 id=&#34;observability&#34;&gt;Observability&lt;/h2&gt;
&lt;p&gt;Since Loki reuses the Prometheus code for recording rules and WALs, it also gains all of Prometheus&amp;rsquo; observability.&lt;/p&gt;
&lt;p&gt;Prometheus exposes a number of metrics for its WAL implementation, and these have all been prefixed with &lt;code&gt;loki_ruler_wal_&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For example: &lt;code&gt;prometheus_remote_storage_bytes_total&lt;/code&gt; → &lt;code&gt;loki_ruler_wal_prometheus_remote_storage_bytes_total&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Additional metrics are exposed, also with the prefix &lt;code&gt;loki_ruler_wal_&lt;/code&gt;. All per-tenant metrics contain a &lt;code&gt;tenant&lt;/code&gt;
label, so be aware that cardinality could begin to be a concern if the number of tenants grows sufficiently large.&lt;/p&gt;
&lt;p&gt;Some key metrics to note are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_appender_ready&lt;/code&gt;: whether a WAL appender is ready to accept samples (1) or not (0)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_samples_total&lt;/code&gt;: number of samples sent per tenant to remote storage&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_samples...&lt;/code&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_samples_pending_total&lt;/code&gt;: samples buffered in memory, waiting to be sent to remote storage&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_samples_failed_total&lt;/code&gt;: samples that failed when sent to remote storage&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_samples_dropped_total&lt;/code&gt;: samples dropped by relabel configurations&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_samples_retried_total&lt;/code&gt;: samples re-resent to remote storage&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_highest_timestamp_in_seconds&lt;/code&gt;: highest timestamp of sample appended to WAL&lt;/li&gt;
&lt;li&gt;&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_queue_highest_sent_timestamp_seconds&lt;/code&gt;: highest timestamp of sample sent to remote storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We&amp;rsquo;ve created a basic &lt;a href=&#34;https://github.com/grafana/loki/tree/main/production/loki-mixin/dashboards/recording-rules.libsonnet&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;dashboard in our loki-mixin&lt;/a&gt;
which you can use to administer recording rules.&lt;/p&gt;
&lt;h2 id=&#34;failure-modes&#34;&gt;Failure Modes&lt;/h2&gt;
&lt;h3 id=&#34;remote-write-lagging&#34;&gt;Remote-Write Lagging&lt;/h3&gt;
&lt;p&gt;Remote-write can lag behind for many reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Remote-write storage (Prometheus) is temporarily unavailable&lt;/li&gt;
&lt;li&gt;A tenant is producing samples too quickly from a recording rule&lt;/li&gt;
&lt;li&gt;Remote-write is tuned too low, creating backpressure&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It can be determined by subtracting
&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_queue_highest_sent_timestamp_seconds&lt;/code&gt; from
&lt;code&gt;loki_ruler_wal_prometheus_remote_storage_highest_timestamp_in_seconds&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In case 1, the &lt;code&gt;ruler&lt;/code&gt; will continue to retry sending these samples until the remote storage becomes available again. Be
aware that if the remote storage is down for longer than &lt;code&gt;ruler.wal.max-age&lt;/code&gt;, data loss may occur after truncation occurs.&lt;/p&gt;
&lt;p&gt;In cases 2 &amp;amp; 3, you should consider &lt;a href=&#34;#tuning&#34;&gt;tuning&lt;/a&gt; remote-write appropriately.&lt;/p&gt;
&lt;p&gt;Further reading: see &lt;a href=&#34;/blog/2021/04/12/how-to-troubleshoot-remote-write-issues-in-prometheus/&#34;&gt;this blog post&lt;/a&gt;
by Prometheus maintainer Callum Styan.&lt;/p&gt;
&lt;h3 id=&#34;appender-not-ready&#34;&gt;Appender Not Ready&lt;/h3&gt;
&lt;p&gt;Each tenant&amp;rsquo;s WAL has an &amp;ldquo;appender&amp;rdquo; internally; this appender is used to &lt;em&gt;append&lt;/em&gt; samples to the WAL. The appender is marked
as &lt;em&gt;not ready&lt;/em&gt; until the WAL replay is complete upon startup. If the WAL is corrupted for some reason, or is taking a long
time to replay, you can determine this by alerting on &lt;code&gt;loki_ruler_wal_appender_ready &amp;lt; 1&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&#34;corrupt-wal&#34;&gt;Corrupt WAL&lt;/h3&gt;
&lt;p&gt;If a disk fails or the &lt;code&gt;ruler&lt;/code&gt; does not terminate correctly, there&amp;rsquo;s a chance one or more tenant WALs can become corrupted.
A mechanism exists for automatically repairing the WAL, but this cannot handle every conceivable scenario. In this case,
the &lt;code&gt;loki_ruler_wal_corruptions_repair_failed_total&lt;/code&gt; metric will be incremented.&lt;/p&gt;
&lt;h3 id=&#34;found-another-failure-mode&#34;&gt;Found another failure mode?&lt;/h3&gt;
&lt;p&gt;Please open an &lt;a href=&#34;https://github.com/grafana/loki/issues&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;issue&lt;/a&gt; and tell us about it!&lt;/p&gt;
]]></content><description>&lt;h1 id="recording-rules">Recording Rules&lt;/h1>
&lt;p>Recording rules are evaluated by the &lt;code>ruler&lt;/code> component. Each &lt;code>ruler&lt;/code> acts as its own &lt;code>querier&lt;/code>, in the sense that it
executes queries against the store without using the &lt;code>query-frontend&lt;/code> or &lt;code>querier&lt;/code> components. It will respect all query
&lt;a href="/docs/loki/latest/configuration/#limits_config">limits&lt;/a> put in place for the &lt;code>querier&lt;/code>.&lt;/p></description></item></channel></rss>