<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Loki Improvement Documents (LIDs) on Grafana Labs</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/</link><description>Recent content in Loki Improvement Documents (LIDs) on Grafana Labs</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="/docs/loki/v3.7.x/community/lids/index.xml" rel="self" type="application/rss+xml"/><item><title>0001: Introducing LIDs</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/0001-introduction/</link><pubDate>Thu, 09 Apr 2026 02:28:18 +0000</pubDate><guid>https://grafana.com/docs/loki/v3.7.x/community/lids/0001-introduction/</guid><content><![CDATA[&lt;h1 id=&#34;0001-introducing-lids&#34;&gt;0001: Introducing LIDs&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Danny Kopping (&lt;a href=&#34;mailto:danny.kopping@grafana.com&#34;&gt;danny.kopping@grafana.com&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 01/2023&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sponsor(s):&lt;/strong&gt; @dannykopping&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; Process&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Accepted&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related issues/PRs:&lt;/strong&gt; N/A&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thread from &lt;a href=&#34;https://groups.google.com/forum/#!forum/lokiproject&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mailing list&lt;/a&gt;:&lt;/strong&gt; N/A&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;As the Grafana Loki project grows, we have seen more and more contributions from external (outside Grafana Labs) contributors.&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;Many of these external contributions are large and complex, and have taken these contributors significant time to implement. Large contributions that are made without prior discussion with maintainers are at risk of being rejected if they are misguided, implemented inefficiently, or simply undesired; this is obviously suboptimal both for the contributors and the maintainers.&lt;/p&gt;
&lt;p&gt;Aside from external contributions, changes being proposed by Grafana Loki maintainers may also require community engagement before being worked on.&lt;/p&gt;
&lt;h2 id=&#34;goals&#34;&gt;Goals&lt;/h2&gt;
&lt;p&gt;It would be preferable to engage with contributors &lt;em&gt;before&lt;/em&gt; they make large contributions to ensure that both their and the project&amp;rsquo;s interests are aligned. The community at large must also have a voice when feature or process changes are being proposed, to protect their own interests.&lt;/p&gt;
&lt;p&gt;We should implement a &lt;strong&gt;lightweight&lt;/strong&gt; process that guides the implementation of major changes to the project.&lt;/p&gt;
&lt;h2 id=&#34;proposals&#34;&gt;Proposals&lt;/h2&gt;
&lt;h3 id=&#34;proposal-0-do-nothing&#34;&gt;Proposal 0: Do nothing&lt;/h3&gt;
&lt;p&gt;We will continue to attract large, often complex, external contributions that have not be discussed with maintainers prior to the work being put in; this may lead to suboptimal outcomes for the relationship between the project and its community.&lt;/p&gt;
&lt;h3 id=&#34;proposal-1-loki-improvement-documents&#34;&gt;Proposal 1: Loki Improvement Documents&lt;/h3&gt;
&lt;p&gt;Inspired by Python&amp;rsquo;s &lt;a href=&#34;https://peps.python.org/pep-0001/&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;PEP&lt;/a&gt; and Kafka&amp;rsquo;s &lt;a href=&#34;https://cwiki.apache.org/confluence/display/KAFKA/Kafka&amp;#43;Improvement&amp;#43;Proposals&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;KIP&lt;/a&gt; approaches, we should create a process for formally documenting improvements to Loki which are permanently viewable, and document our decisions.&lt;/p&gt;
&lt;h2 id=&#34;other-notes&#34;&gt;Other Notes&lt;/h2&gt;
&lt;p&gt;Google Docs were considered for this, but they are less useful because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;they would need to be owned by the Grafana Labs organisation, so that they remain viewable even if the author closes their account&lt;/li&gt;
&lt;li&gt;we already have previous &lt;a href=&#34;../../design-documents/&#34;&gt;design documents&lt;/a&gt; in our documentation and, in a recent (&lt;a href=&#34;https://docs.google.com/document/d/1MNjiHQxwFukm2J4NJRWyRgRIiK7VpokYyATzJ5ce-O8/edit#heading=h.78vexgrrtw5a&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;5th Jan 2023&lt;/a&gt;) community call, the community expressed a preference for this type of approach&lt;/li&gt;
&lt;/ul&gt;
]]></content><description>&lt;h1 id="0001-introducing-lids">0001: Introducing LIDs&lt;/h1>
&lt;p>&lt;strong>Author:&lt;/strong> Danny Kopping (&lt;a href="mailto:danny.kopping@grafana.com">danny.kopping@grafana.com&lt;/a>)&lt;/p>
&lt;p>&lt;strong>Date:&lt;/strong> 01/2023&lt;/p>
&lt;p>&lt;strong>Sponsor(s):&lt;/strong> @dannykopping&lt;/p>
&lt;p>&lt;strong>Type:&lt;/strong> Process&lt;/p>
&lt;p>&lt;strong>Status:&lt;/strong> Accepted&lt;/p>
&lt;p>&lt;strong>Related issues/PRs:&lt;/strong> N/A&lt;/p>
&lt;p>&lt;strong>Thread from &lt;a href="https://groups.google.com/forum/#!forum/lokiproject" target="_blank" rel="noopener noreferrer">mailing list&lt;/a>:&lt;/strong> N/A&lt;/p>
&lt;hr />
&lt;h2 id="background">Background&lt;/h2>
&lt;p>As the Grafana Loki project grows, we have seen more and more contributions from external (outside Grafana Labs) contributors.&lt;/p></description></item><item><title>0002: Remote Rule Evaluation</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/0002-remoteruleevaluation/</link><pubDate>Thu, 09 Apr 2026 02:28:18 +0000</pubDate><guid>https://grafana.com/docs/loki/v3.7.x/community/lids/0002-remoteruleevaluation/</guid><content><![CDATA[&lt;h1 id=&#34;0002-remote-rule-evaluation&#34;&gt;0002: Remote Rule Evaluation&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Danny Kopping (&lt;a href=&#34;mailto:danny.kopping@grafana.com&#34;&gt;danny.kopping@grafana.com&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 01/2023&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sponsor(s):&lt;/strong&gt; @dannykopping&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; Feature&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Accepted&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related issues/PRs:&lt;/strong&gt; &lt;a href=&#34;https://github.com/grafana/mimir/pull/1536&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;https://github.com/grafana/mimir/pull/1536&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thread from &lt;a href=&#34;https://groups.google.com/forum/#!forum/lokiproject&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mailing list&lt;/a&gt;:&lt;/strong&gt; N/A&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;ruler&lt;/code&gt; is a component that evaluates alerting and recording rules. Loki reuses Prometheus&amp;rsquo; rule evaluation engine. The &lt;code&gt;ruler&lt;/code&gt; currently operates by initialising a &lt;code&gt;querier&lt;/code&gt; internally and evaluating all rules &amp;ldquo;locally&amp;rdquo; (i.e. it does not rely on any other components). Each rule group executes concurrently, and rules within the rule group are evaluated sequentially (this is an implementation detail from Prometheus).&lt;/p&gt;
&lt;p&gt;Recording rules produce metric series which are sent to a Prometheus-compatible source. Alerting rules send notifications to Alertmanager when a condition is met. Both of these rule types can play a vital role in an organisation&amp;rsquo;s observability strategy, and so their reliable evaluation is essential.&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;Rule evaluations can contain expensive queries. The &lt;code&gt;ruler&lt;/code&gt; initialises a &lt;code&gt;querier&lt;/code&gt;, but the &lt;code&gt;querier&lt;/code&gt; does not have the capability to accelerate queries; the &lt;code&gt;query-frontend&lt;/code&gt; component is responsible for query acceleration through splitting, sharding, caching, and other techniques.&lt;/p&gt;
&lt;p&gt;An expensive rule query can cause an entire &lt;code&gt;ruler&lt;/code&gt; instance to use excessive resources and even crash. This is highly problematic for the following reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;slow rule evaluations can lead to subsequent rules in a group to be delayed or missed, leading to missing alerts or gaps in recording rule metrics&lt;/li&gt;
&lt;li&gt;excessive resource usage can impede the evaluation of rules for other tenants (noisy neighbour)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;goals&#34;&gt;Goals&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;faster, more efficient rule evaluation&lt;/li&gt;
&lt;li&gt;greater isolation between tenants&lt;/li&gt;
&lt;li&gt;more reliable service&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;non-goals&#34;&gt;Non-Goals&lt;/h2&gt;
&lt;p&gt;This proposal does not aim to make this option the default mode of evaluation; it should be optional because it increases operational complexity.&lt;/p&gt;
&lt;h2 id=&#34;proposals&#34;&gt;Proposals&lt;/h2&gt;
&lt;h3 id=&#34;proposal-0-do-nothing&#34;&gt;Proposal 0: Do nothing&lt;/h3&gt;
&lt;p&gt;Loki&amp;rsquo;s current &lt;code&gt;ruler&lt;/code&gt; implementation is sufficient for small installations running relatively simple or inexpensive queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Nothing to be done&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Loki&amp;rsquo;s &lt;code&gt;ruler&lt;/code&gt; will remain unreliable and inefficient when used in large multi-tenant environments with expensive queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;proposal-1-remote-execution&#34;&gt;Proposal 1: Remote Execution&lt;/h3&gt;
&lt;p&gt;Taking inspiration from &lt;a href=&#34;/docs/mimir/latest/operators-guide/architecture/components/ruler/#remote&#34;&gt;Grafana Mimir&amp;rsquo;s implementation&lt;/a&gt;, the &lt;code&gt;ruler&lt;/code&gt; would be configured to send its rule query to the &lt;code&gt;query-frontend&lt;/code&gt; component over gRPC. The &lt;code&gt;querier&lt;/code&gt; instances receiving queries from the &lt;code&gt;query-frontend&lt;/code&gt; (or optionally via the &lt;code&gt;query-scheduler&lt;/code&gt;) will handle the request and send the responses to the &lt;code&gt;query-frontend&lt;/code&gt; and be combined. The &lt;code&gt;ruler&lt;/code&gt; will receive and process these responses as if the query had been executed locally.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Takes full advantage of Loki&amp;rsquo;s query acceleration techniques, leading to faster and more efficient rule evaluation&lt;/li&gt;
&lt;li&gt;Operationally simple as existing &lt;code&gt;query-frontend&lt;/code&gt;/&lt;code&gt;query-scheduler&lt;/code&gt;/&lt;code&gt;querier&lt;/code&gt; setup can be used&lt;/li&gt;
&lt;li&gt;Per-tenant isolation available in Loki&amp;rsquo;s query path (shuffle-sharding, per-tenant queues) can be used to reduce or eliminate the noisy neighbour problem&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increased interdependence in components, increased cross-component networking&lt;/li&gt;
&lt;li&gt;Reusing the same &lt;code&gt;query-frontend&lt;/code&gt;/&lt;code&gt;query-scheduler&lt;/code&gt;/&lt;code&gt;querier&lt;/code&gt; setup can cause expensive queries to starve rule evaluations of query resources, and vice versa
&lt;ul&gt;
&lt;li&gt;Additional complexity introduced if this setup needs to be duplicated for rule evaluations (recommended: see &lt;strong&gt;Other Notes&lt;/strong&gt; section below)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;other-notes&#34;&gt;Other Notes&lt;/h2&gt;
&lt;p&gt;If this feature were to be used in conjunction with &lt;a href=&#34;https://github.com/grafana/loki/pull/8092&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;rule-based sharding&lt;/a&gt;, this can present some further optimisation but also some additional challenges to consider.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Aside: the &lt;code&gt;ruler&lt;/code&gt; shards by rule group by default, which means that rules can be unevenly balanced across &lt;code&gt;ruler&lt;/code&gt; instances if some rule groups have more expensive queries than others. Another consequence of this is that rule groups execute sequentially, so expensive queries can cause subsequent rules in the group to be delayed or even missed. Rule groups are evaluated concurrently.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Rule-based sharding distributes rules evenly across all available &lt;code&gt;ruler&lt;/code&gt; instances, each in their own rule group. Consequentially, each rule that belongs to a &lt;code&gt;ruler&lt;/code&gt; instance will be evaluated concurrently (as they&amp;rsquo;re each in their own rule group). For tenants with hundreds or thousands of rules, this can result in large batches of queries being sent to the &lt;code&gt;query-frontend&lt;/code&gt; in quick succession, should they all use the same interval or happen to overlap.&lt;/p&gt;
&lt;p&gt;Assuming the remote rule evaluation takes place on the same read path that is used to execute tenant queries, care must be taken by operators who run large multi-tenant setups to ensure that large volumes of queries can be received, queued, and processed in an acceptable timeframe. The &lt;code&gt;query-scheduler&lt;/code&gt; component is highly recommended in these situations, as it will enable the &lt;code&gt;query-frontend&lt;/code&gt; and &lt;code&gt;query&lt;/code&gt; components to scale out to accommodate the load. Shuffle-sharding should also be implemented to ensure that tenants with particularly large workloads do not starve out the query resources of other tenants. Alerting should also be put in place to notify operators if rule evaluations are being routinely missed or a tenants&amp;rsquo; query queues become full.&lt;/p&gt;
&lt;p&gt;If rule evaluations and tenant queries are slowing each other down, the read path setup would need to be duplicated so that tenant queries and rule evaluations would not share the same query execution resources.&lt;/p&gt;
&lt;p&gt;Rule-based sharding and remote evaluation can (and should) be implemented separately. Operators should first implement remote evaluation to improve &lt;code&gt;ruler&lt;/code&gt; reliability, and &lt;em&gt;then&lt;/em&gt; further investigate rule-based sharding if rule evaluations are still being missed due to the sequential execution of rule groups, or advise their tenants to split these rule groups up.&lt;/p&gt;
]]></content><description>&lt;h1 id="0002-remote-rule-evaluation">0002: Remote Rule Evaluation&lt;/h1>
&lt;p>&lt;strong>Author:&lt;/strong> Danny Kopping (&lt;a href="mailto:danny.kopping@grafana.com">danny.kopping@grafana.com&lt;/a>)&lt;/p>
&lt;p>&lt;strong>Date:&lt;/strong> 01/2023&lt;/p>
&lt;p>&lt;strong>Sponsor(s):&lt;/strong> @dannykopping&lt;/p>
&lt;p>&lt;strong>Type:&lt;/strong> Feature&lt;/p>
&lt;p>&lt;strong>Status:&lt;/strong> Accepted&lt;/p>
&lt;p>&lt;strong>Related issues/PRs:&lt;/strong> &lt;a href="https://github.com/grafana/mimir/pull/1536" target="_blank" rel="noopener noreferrer">https://github.com/grafana/mimir/pull/1536&lt;/a>&lt;/p>
&lt;p>&lt;strong>Thread from &lt;a href="https://groups.google.com/forum/#!forum/lokiproject" target="_blank" rel="noopener noreferrer">mailing list&lt;/a>:&lt;/strong> N/A&lt;/p>
&lt;hr />
&lt;h2 id="background">Background&lt;/h2>
&lt;p>The &lt;code>ruler&lt;/code> is a component that evaluates alerting and recording rules. Loki reuses Prometheus&amp;rsquo; rule evaluation engine. The &lt;code>ruler&lt;/code> currently operates by initialising a &lt;code>querier&lt;/code> internally and evaluating all rules &amp;ldquo;locally&amp;rdquo; (i.e. it does not rely on any other components). Each rule group executes concurrently, and rules within the rule group are evaluated sequentially (this is an implementation detail from Prometheus).&lt;/p></description></item><item><title>0003: Query fairness across users within tenants</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/0003-queryfairnessinscheduler/</link><pubDate>Thu, 09 Apr 2026 02:28:18 +0000</pubDate><guid>https://grafana.com/docs/loki/v3.7.x/community/lids/0003-queryfairnessinscheduler/</guid><content><![CDATA[&lt;h1 id=&#34;0003-query-fairness-across-users-within-tenants&#34;&gt;0003: Query fairness across users within tenants&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Christian Haudum (&lt;a href=&#34;mailto:christian.haudum@grafana.com&#34;&gt;christian.haudum@grafana.com&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 02/2023&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sponsor(s):&lt;/strong&gt; @chaudum @owen-d&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; Feature&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Accepted&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related issues/PRs:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thread from &lt;a href=&#34;https://groups.google.com/forum/#!forum/lokiproject&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mailing list&lt;/a&gt;:&lt;/strong&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;The query scheduler (or short scheduler) is a component of Loki that distributes requests (sub-queries) from the query frontend (or short frontend) to the querier workers so that execution fairness between tenants can be guaranteed.&lt;/p&gt;
&lt;p&gt;By maintaining separate FIFO queues for each tenant and assigning the correct amount of querier workers to these queues, the scheduler takes care that a single tenant cannot compromise all other tenants&amp;rsquo; query capabilities.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Component diagram:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../scheduler-component-diagram.png&#34;
  alt=&#34;scheduler-component-diagram.plantuml&#34;/&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sequence diagram:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../scheduler-sequence-diagram.png&#34;
  alt=&#34;scheduler-sequence-diagram.plantuml&#34;/&gt;&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;Even though Loki is built as multi-tenant system by default, there are use-cases where a Loki installation only has a very large, single tenant, e.g. dedicated Loki cells for customers in Grafana Cloud.&lt;/p&gt;
&lt;p&gt;However, there are potentially a lot of different users using the same tenant to query logs, such as users accessing Loki from Grafana or via CLI or HTTP API. This can lead to contention between queries of different users, because they all share the same tenant.&lt;/p&gt;
&lt;p&gt;While the current implementation of the scheduler queues allows for QoS guarantees between tenants, it does not account for QoS guarantees across individual users within a single tenant.&lt;/p&gt;
&lt;p&gt;That said, Loki does not have the notation of individual users.&lt;/p&gt;
&lt;h2 id=&#34;goals&#34;&gt;Goals&lt;/h2&gt;
&lt;p&gt;The main goal of the following proposals is to lay out ideas how to improve the scheduler component to not only assure QoS across tenants, but also across actors (users) within a tenant, without requiring any changes to the deployment model of frontend, scheduler and queriers.
This should also include changes to the queue structure to be easily extensible for future scheduling improvements.&lt;/p&gt;
&lt;h2 id=&#34;non-goals-optional&#34;&gt;Non-Goals (optional)&lt;/h2&gt;
&lt;p&gt;While changing and extending the scheduler requires also user-facing API changes, the public API is not part of the discussion of this document.&lt;/p&gt;
&lt;h2 id=&#34;proposals&#34;&gt;Proposals&lt;/h2&gt;
&lt;h3 id=&#34;proposal-0-do-nothing&#34;&gt;Proposal 0: Do nothing&lt;/h3&gt;
&lt;p&gt;An alternative to changing the scheduling mechanism is to handle QoS control via multiple tenants and multi-tenant querying.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Keeps the scheduler as simple as it is now&lt;/li&gt;
&lt;li&gt;No development time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;While that separation into tenants may work for some prospects, it might not be feasible to implement for others.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;proposal-1-add-fixed-second-level-to-scheduler&#34;&gt;Proposal 1: Add fixed second level to scheduler&lt;/h3&gt;
&lt;p&gt;The current scheduler is implemented in a way that it maintains a separate FIFO queue for each tenant. When a request (sub-query) is enqueued, the scheduler puts it into the existing queue for that tenant. If the queue does not exist yet, it creates it first and re-assignes the connected querier workers to the available tenant queues. Each querier worker pulls round-robin from the assigned queues in a loop.&lt;/p&gt;
&lt;p&gt;Now, instead of enqueuing and pulling directly from the per-tenant queue, requests get enqueued in per-user queues and the per-tenant queue pulls round-robin from the user queues that are assigned to the tenant queues.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Component diagram:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../scheduler-proposal-1-component-diagram.png&#34;
  alt=&#34;scheduler-proposal-1-component-diagram.plantuml&#34;/&gt;&lt;/p&gt;
&lt;p&gt;Like the current implementation, the scheduler enqueues requests based on the &lt;code&gt;X-Scope-OrgID&lt;/code&gt; header (or equivalent key in the request context), but also takes a second key (such as &lt;code&gt;X-Scope-UserID&lt;/code&gt;) into account. This ensembles a fixed hierarchy with two levels where the tenant-to-user relation is a one-to-many relation.
However, this has the disadvantage that the concept of users (that does not exist yet in Loki) leaks into the scheduler domain.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Relatively simple to to implement&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Not extensible&lt;/li&gt;
&lt;li&gt;Leaks domain knowledge&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;proposal-2-fully-hierarchical-scheduler&#34;&gt;Proposal 2: Fully hierarchical scheduler&lt;/h3&gt;
&lt;p&gt;This proposal is similar to &lt;em&gt;Proposal 1&lt;/em&gt;, but with the difference that there are no fixed levels and levels can be nested arbitrarily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Component diagram:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../scheduler-proposal-2-component-diagram.png&#34;
  alt=&#34;scheduler-proposal-2-component-diagram.plantuml&#34;/&gt;&lt;/p&gt;
&lt;p&gt;The implementation of the &lt;code&gt;RequestQueue&lt;/code&gt;, which controls what querier workers are connected to which root queues (aka tenant queues), can be kept as is. However, the concept of tenants and users is dropped and replaced by by a concept of hierarchical actors, which can be represented as a slice of identifiers. Note, this does &lt;strong&gt;not&lt;/strong&gt; drop the concept of tenants throughout Loki (represented in the &lt;code&gt;X-Scope-OrgID&lt;/code&gt; header and/or request context).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example of identifiers:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;Go&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-go&#34;&gt;actorA := []string{&amp;#34;tenant_a&amp;#34;, &amp;#34;user_1&amp;#34;}
actorB := []string{&amp;#34;tenant_b&amp;#34;, &amp;#34;user_2&amp;#34;}
actorC := []string{&amp;#34;tenant_b&amp;#34;, &amp;#34;user_3&amp;#34;, &amp;#34;service_foo&amp;#34;}
actorD := []string{&amp;#34;tenant_b&amp;#34;, &amp;#34;user_3&amp;#34;, &amp;#34;service_bar&amp;#34;}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;More generally:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;Go&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-go&#34;&gt;actorN := []string{&amp;#34;L0 Queue&amp;#34;, &amp;#34;L1 Queue&amp;#34;, &amp;#34;L2 Queue&amp;#34;, ... &amp;#34;Ln Queue&amp;#34;}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The L0 queue (root queue) needs to be able to handle worker connections and therefore needs additional functionality compared to its leaf queues.&lt;/p&gt;
&lt;p&gt;The following code snippet is meant to show the simplified recursive structure of the queues.&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;Go&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-go&#34;&gt;type Request interface{}

type Queue interface {
    Deqeue(actor []string) Request
    Enqueue(r Request, actor []string) error
}

// RequestQueue implements Queue
type RequestQueue struct {
    queriers   map[string]*querier
    rootQueues map[string]*RootQueue
}

// RootQueue implements Queue
type RootQueue struct {
    queriers map[string]*querier
    leafs    map[string]*LeafQueue
    ch       chan Request
}

// LeafQueue implements Queue
type LeafQueue struct {
    leafs map[string]*LeafQueue
    ch    chan Request
}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Backwards compatible, because tenant can be identified as &lt;code&gt;[]string{&amp;quot;tenantID&amp;quot;}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Queue hierarchy can be extended without changing the scheduler implementation&lt;/li&gt;
&lt;li&gt;Implementation does not require knowledge outside of its domain&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;More complex to implement than fixed amount of levels&lt;/li&gt;
&lt;li&gt;Each queue comes with memory overhead&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;proposal-3-multiple-per-tenant-sub-queues&#34;&gt;Proposal 3: Multiple per-tenant sub-queues&lt;/h3&gt;
&lt;p&gt;Another option to keep the concept of users out of Loki and still provide some query fairness guarantees would be to simply shard request across multiple sub-queues within a tenant&amp;rsquo;s queue. The shard size could be a per-tenant setting to account for different tenant sizes.&lt;/p&gt;
&lt;p&gt;This is similar to Proposal 1, in the sense of adding another fixed level of sub-queues.
However, with the difference, that in this case, a single query request is assigned a random identifier that is hashed. When the query is split, the sub-requests maintain the same hashed identifier. The modulor of the hash defines to which sub-queue of a tenant requests will be enqueued.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;User agnostic per-request QoS control&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requests of individual users can still effect other users&lt;/li&gt;
&lt;li&gt;Not extensible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Alternative:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Sharding on a per-request basis can still be achieved with Proposal 2, by adding the request hash as an additional level in the hierarchy.&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;Go&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-go&#34;&gt;actor := []string{&amp;#34;tenant&amp;#34;, &amp;#34;user&amp;#34;, &amp;#34;request_hash&amp;#34;}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;h2 id=&#34;consensus&#34;&gt;Consensus&lt;/h2&gt;
&lt;p&gt;Proposal 2 is going to be implemented.&lt;/p&gt;
]]></content><description>&lt;h1 id="0003-query-fairness-across-users-within-tenants">0003: Query fairness across users within tenants&lt;/h1>
&lt;p>&lt;strong>Author:&lt;/strong> Christian Haudum (&lt;a href="mailto:christian.haudum@grafana.com">christian.haudum@grafana.com&lt;/a>)&lt;/p>
&lt;p>&lt;strong>Date:&lt;/strong> 02/2023&lt;/p>
&lt;p>&lt;strong>Sponsor(s):&lt;/strong> @chaudum @owen-d&lt;/p>
&lt;p>&lt;strong>Type:&lt;/strong> Feature&lt;/p>
&lt;p>&lt;strong>Status:&lt;/strong> Accepted&lt;/p>
&lt;p>&lt;strong>Related issues/PRs:&lt;/strong>&lt;/p>
&lt;p>&lt;strong>Thread from &lt;a href="https://groups.google.com/forum/#!forum/lokiproject" target="_blank" rel="noopener noreferrer">mailing list&lt;/a>:&lt;/strong>&lt;/p></description></item><item><title>0004: Index Gateway Sharding</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/0004-indexgatewaysharding/</link><pubDate>Thu, 09 Apr 2026 02:28:18 +0000</pubDate><guid>https://grafana.com/docs/loki/v3.7.x/community/lids/0004-indexgatewaysharding/</guid><content><![CDATA[&lt;h1 id=&#34;0004-index-gateway-sharding&#34;&gt;0004: Index Gateway Sharding&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Christian Haudum (&lt;a href=&#34;mailto:christian.haudum@grafana.com&#34;&gt;christian.haudum@grafana.com&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 02/2023&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sponsor(s):&lt;/strong&gt; @chaudum @owen-d&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; Feature&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Rejected / Not Implemented&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related issues/PRs:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thread from &lt;a href=&#34;https://groups.google.com/forum/#!forum/lokiproject&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mailing list&lt;/a&gt;:&lt;/strong&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;This document tries to come up with a proposal on how to do a better sharding of data on the index gateways so we are able to scale the service horizontally to fulfill the increased need for metadata queries of big tenants.&lt;/p&gt;
&lt;p&gt;The index gateway service can be run in &amp;ldquo;simple mode&amp;rdquo;, where an index gateway instance is responsible for handling, storing and returning requests for all indices for all tenants, or in &amp;ldquo;ring mode&amp;rdquo;, where an instance is responsible for a subset of tenants instead of all tenants.&lt;/p&gt;
&lt;p&gt;On top of that, in order to achieve redundancy as well as spreading load, the index gateway ring uses by default a replication factor of 3.&lt;/p&gt;
&lt;p&gt;This means, before an index gateway client makes a request to the index gateway server, it first hashes the tenant ID and then requests a replication set for that hash from the index gateway ring. Due to the fixed replication factor (RF), the replication set contains three server addresses. On every request, a random server from that list is picked to then execute the request on.&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;The current strategy of sharding by tenant ID and having a replication factor fails in the long run, because even when running lots of index gateways, only a maximum of &lt;code&gt;n&lt;/code&gt; instances could be utilized by a single tenant, where &lt;code&gt;n&lt;/code&gt; is the value of the configured RF.&lt;/p&gt;
&lt;p&gt;Another problem is that the RF is fixed and the same for all tenants, independent of their actual size in terms of log volume or query rate.&lt;/p&gt;
&lt;h2 id=&#34;goals&#34;&gt;Goals&lt;/h2&gt;
&lt;p&gt;The goal of this document is to find a better sharding mechanism for the index gateway, so that there are no boundaries for scaling the service horizontally.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The sharding needs to account for the &amp;ldquo;size&amp;rdquo; of a tenant.&lt;/li&gt;
&lt;li&gt;A single tenant needs to be able to utilize more than three index gateways.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;proposals&#34;&gt;Proposals&lt;/h2&gt;
&lt;h3 id=&#34;proposal-0-do-nothing&#34;&gt;Proposal 0: Do nothing&lt;/h3&gt;
&lt;p&gt;If we do not improve the sharding mechanism for the index gateways and leave it as it is, it will become more and more difficult to serve metadata queries for large tenant in a reasonable amount of time, proportionally to the demand for these queries.&lt;/p&gt;
&lt;h3 id=&#34;proposal-1-dynamic-replication-factor&#34;&gt;Proposal 1: Dynamic replication factor&lt;/h3&gt;
&lt;p&gt;Instead of using a fixed replication factor of 3, the RF can be derived from the amount of active members in the index gateway ring. That means that the RF would be a percentage of the available gateway instances. For example, a ring with 12 instances and 30% replication utilization would result in a RF of 3 (&lt;code&gt;floor(12*0.3)&lt;/code&gt;). Scaling up to 18 instances would result in a RF of 5.&lt;/p&gt;
&lt;p&gt;This approach would solve the problem of horizontal scaling. However, it does not solve the problem of different tenant sizes. It also fails to ensure replication for availability when running a small number of instances, unless there is a fixed lower value for the RF. It also tends to over-shard data in large index gateway deployments.&lt;/p&gt;
&lt;h3 id=&#34;proposal-2-fixed-per-tenant-replication-factor&#34;&gt;Proposal 2: Fixed per-tenant replication factor&lt;/h3&gt;
&lt;p&gt;Adding a random shard ID (e.g. &lt;code&gt;shard-0&lt;/code&gt;, &lt;code&gt;shard-1&lt;/code&gt;, &amp;hellip; &lt;code&gt;shard-n&lt;/code&gt;) to the tenant ID allows to utilize a certain amount of &lt;code&gt;n&lt;/code&gt; instances. The amount of shards can be implemented as a per-tenant override setting. This would allow to use different amount of instances for each tenant. However, this approach results in non-deterministic hash keys.&lt;/p&gt;
&lt;h3 id=&#34;proposal-3-shard-by-index-files&#34;&gt;Proposal 3: Shard by index files&lt;/h3&gt;
&lt;p&gt;In order to answer requests, the index gateway needs to download index files from the object storage, and since Loki builds a daily index file per tenant, these index files can be sharded evenly across all available index gateway instances. Each instance is then assigned a unique set of index files which it can answer metadata queries for.&lt;/p&gt;
&lt;p&gt;This means that the sharding key is the name of the file in object storage. While this name encodes both the tenant and the date, this is not strictly necessary. Such a sharding mechanism could shard any files from object storage across a set of instances of a ring.&lt;/p&gt;
&lt;p&gt;If the time range for the requested metadata is within a single day then a single index gateway instance can answer the metadata request.
However, if a metadata request spans multiple days, also multiple index gateway instances are involved. There are two ways to solve this:&lt;/p&gt;
&lt;h4 id=&#34;a-split-and-merge-on-client-side&#34;&gt;A) Split and merge on client side&lt;/h4&gt;
&lt;p&gt;The client resolves the necessary index files and their respective gateway instances. It splits the request into multiple sub-requests, executes them and merges them into a single result.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Only the minimum necessary amount of requests are performed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The client requires information about how to split and merge requests.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&#34;b-split-and-merge-on-index-gateway-handler-side&#34;&gt;B) Split and merge on index gateway handler side&lt;/h4&gt;
&lt;p&gt;The client can execute a request on any index gateway. This handler instance then identifies the index files that are involved, splits the query, and resolves the appropriate instances. Once it received the sub-queries it resembles the full response result and sends it back to the client.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sharding is handled transparently to the client.&lt;/li&gt;
&lt;li&gt;Clients can communicate with any instance of the index gateway ring.&lt;/li&gt;
&lt;li&gt;Domain information about splitting and merging is kept within index gateway server implementation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Due to it&amp;rsquo;s architectural advantages, option B is proposed.&lt;/p&gt;
&lt;h2 id=&#34;other-notes&#34;&gt;Other Notes&lt;/h2&gt;
&lt;h3 id=&#34;architectural-diagram-of-proposal-3&#34;&gt;Architectural diagram of proposal 3&lt;/h3&gt;
&lt;p&gt;&lt;img
  class=&#34;lazyload d-inline-block&#34;
  data-src=&#34;../index-gw-sharding-diagram.svg&#34;
  alt=&#34;index gateway sharding&#34;/&gt;&lt;/p&gt;
]]></content><description>&lt;h1 id="0004-index-gateway-sharding">0004: Index Gateway Sharding&lt;/h1>
&lt;p>&lt;strong>Author:&lt;/strong> Christian Haudum (&lt;a href="mailto:christian.haudum@grafana.com">christian.haudum@grafana.com&lt;/a>)&lt;/p>
&lt;p>&lt;strong>Date:&lt;/strong> 02/2023&lt;/p>
&lt;p>&lt;strong>Sponsor(s):&lt;/strong> @chaudum @owen-d&lt;/p>
&lt;p>&lt;strong>Type:&lt;/strong> Feature&lt;/p>
&lt;p>&lt;strong>Status:&lt;/strong> Rejected / Not Implemented&lt;/p>
&lt;p>&lt;strong>Related issues/PRs:&lt;/strong>&lt;/p>
&lt;p>&lt;strong>Thread from &lt;a href="https://groups.google.com/forum/#!forum/lokiproject" target="_blank" rel="noopener noreferrer">mailing list&lt;/a>:&lt;/strong>&lt;/p></description></item><item><title>0005: Loki mixin configuration improvements</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/0005-loki-mixin-configuration-improvements/</link><pubDate>Thu, 09 Apr 2026 02:28:18 +0000</pubDate><guid>https://grafana.com/docs/loki/v3.7.x/community/lids/0005-loki-mixin-configuration-improvements/</guid><content><![CDATA[&lt;h1 id=&#34;0005-loki-mixin-configuration-improvements&#34;&gt;0005: Loki mixin configuration improvements&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Alexandre Chouinard (&lt;a href=&#34;mailto:Daazku@gmail.com&#34;&gt;Daazku@gmail.com&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 03/2025&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sponsor(s):&lt;/strong&gt; N/A&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; Feature&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Draft&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related issues/PRs:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/grafana/loki/issues/15881&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Issue #15881&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/grafana/loki/issues/13631&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Issue #13631&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/grafana/loki/issues/11820&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Issue #11820&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/grafana/loki/issues/11806&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Issue #11806&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/grafana/loki/issues/7730&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Issue #7730&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;and more &amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Thread from &lt;a href=&#34;https://groups.google.com/forum/#!forum/lokiproject&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mailing list&lt;/a&gt;:&lt;/strong&gt; N/A&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;There is no easy way to set up dashboards and alerts for Loki on a pre-existing Prometheus stack that does not use the Prometheus Operator with a specific configuration.&lt;/p&gt;
&lt;p&gt;The metrics selectors are hardcoded, making the dashboard unusable without manual modifications in many cases.
It is assumed that &lt;code&gt;job&lt;/code&gt;, &lt;code&gt;cluster&lt;/code&gt;, &lt;code&gt;namespace&lt;/code&gt;, &lt;code&gt;container&lt;/code&gt; and/or a combination of other labels are present on metrics and have very specific values.&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;This renders the dashboards and alerts unusable for setups that do not conform to the current assumptions about which label(s) should be present in the metrics.&lt;/p&gt;
&lt;p&gt;A good example of that would be the &amp;ldquo;job&amp;rdquo; label used everywhere:
&lt;a href=&#34;https://github.com/grafana/loki/blob/475d25f459575312adb25ff90abf8f10d521ad4b/production/loki-mixin/dashboards/dashboard-bloom-build.json#L267C101-L267C134&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;&lt;code&gt;job=~\&amp;quot;$namespace/bloom-planner\&amp;quot;&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Usually the job label refer to the task name used to scrape the targets, as per &lt;a href=&#34;https://prometheus.io/docs/concepts/jobs_instances/&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Prometheus documentation&lt;/a&gt;, and
in k8s, if you are not using &lt;code&gt;prometheus-operator&lt;/code&gt; with &lt;code&gt;ServiceMonitor&lt;/code&gt;, it&amp;rsquo;s pretty common to have something like this as a scraping config:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;YAML&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-yaml&#34;&gt;        - job_name: &amp;#34;kubernetes-pods&amp;#34; # Can actually be anything you want.
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            # Cluster label is &amp;#34;required&amp;#34; by kubernetes-mixin dashboards
            - target_label: cluster
              replacement: &amp;#39;${cluster_label}&amp;#39;
            ...&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;which would scrape all pods and yield something like:&lt;/p&gt;

&lt;div class=&#34;code-snippet code-snippet__mini&#34;&gt;&lt;div class=&#34;lang-toolbar__mini&#34;&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet code-snippet__border&#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-none&#34;&gt;up{job=&amp;#34;kubernetes-pods&amp;#34;, ...}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Right off the bat, that makes the dashboards unusable because it&amp;rsquo;s incompatible with what is &lt;strong&gt;hardcoded&lt;/strong&gt; in the dashboards and alerts.&lt;/p&gt;
&lt;h2 id=&#34;goals&#34;&gt;Goals&lt;/h2&gt;
&lt;p&gt;Ideally, selectors should default to the values required internally by Grafana but remain configurable so users can tailor them to their setup.&lt;/p&gt;
&lt;p&gt;A good example of this is how &lt;a href=&#34;https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/1fa3b6731c93eac6d5b8c3c3b087afab2baabb42/config.libsonnet#L20-L33&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;kubernetes-monitoring/kubernetes-mixin&lt;/a&gt; did it:
Every possible selector is configurable and thus allow for various setup to properly work.&lt;/p&gt;
&lt;p&gt;The structure is already there to support this. It just has not been leveraged properly.&lt;/p&gt;
&lt;h2 id=&#34;non-goals-optional&#34;&gt;Non-Goals (optional)&lt;/h2&gt;
&lt;p&gt;It would be desirable to create some automated checks verifying that all metrics used in dashboard and alerts are using the proper selector(s) from the configuration.
There are many issues in the repository about new dashboards or dashboard updates not using the proper labels on metrics.&lt;/p&gt;
&lt;h2 id=&#34;proposals&#34;&gt;Proposals&lt;/h2&gt;
&lt;h3 id=&#34;proposal-0-do-nothing&#34;&gt;Proposal 0: Do nothing&lt;/h3&gt;
&lt;p&gt;This forces the community to either manually edit the dashboards/alerts or conform to a specific metric collection approach for Loki.&lt;/p&gt;
&lt;h3 id=&#34;proposal-1-allow-metrics-label-selectors-to-be-configurable&#34;&gt;Proposal 1: Allow metrics label selectors to be configurable&lt;/h3&gt;
&lt;p&gt;This will require a good amount of refactoring.&lt;/p&gt;
&lt;p&gt;It allows easier adoption of the &amp;ldquo;official&amp;rdquo; dashboards and alerts by the community.&lt;/p&gt;
&lt;p&gt;Define once, reuse everywhere. (Currently, updating requires extensive search and replace.)&lt;/p&gt;
&lt;h2 id=&#34;other-notes&#34;&gt;Other Notes&lt;/h2&gt;
&lt;p&gt;If this proposal is accepted, I am willing to do the necessary work to move it forward.&lt;/p&gt;
]]></content><description>&lt;h1 id="0005-loki-mixin-configuration-improvements">0005: Loki mixin configuration improvements&lt;/h1>
&lt;p>&lt;strong>Author:&lt;/strong> Alexandre Chouinard (&lt;a href="mailto:Daazku@gmail.com">Daazku@gmail.com&lt;/a>)&lt;/p>
&lt;p>&lt;strong>Date:&lt;/strong> 03/2025&lt;/p>
&lt;p>&lt;strong>Sponsor(s):&lt;/strong> N/A&lt;/p>
&lt;p>&lt;strong>Type:&lt;/strong> Feature&lt;/p>
&lt;p>&lt;strong>Status:&lt;/strong> Draft&lt;/p>
&lt;p>&lt;strong>Related issues/PRs:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/grafana/loki/issues/15881" target="_blank" rel="noopener noreferrer">Issue #15881&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/grafana/loki/issues/13631" target="_blank" rel="noopener noreferrer">Issue #13631&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/grafana/loki/issues/11820" target="_blank" rel="noopener noreferrer">Issue #11820&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/grafana/loki/issues/11806" target="_blank" rel="noopener noreferrer">Issue #11806&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/grafana/loki/issues/7730" target="_blank" rel="noopener noreferrer">Issue #7730&lt;/a>&lt;/li>
&lt;li>and more &amp;hellip;&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Thread from &lt;a href="https://groups.google.com/forum/#!forum/lokiproject" target="_blank" rel="noopener noreferrer">mailing list&lt;/a>:&lt;/strong> N/A&lt;/p></description></item><item><title>0006: Expose Split Logic in API</title><link>https://grafana.com/docs/loki/v3.7.x/community/lids/0006-api-expose-split/</link><pubDate>Thu, 09 Apr 2026 02:28:18 +0000</pubDate><guid>https://grafana.com/docs/loki/v3.7.x/community/lids/0006-api-expose-split/</guid><content><![CDATA[&lt;h1 id=&#34;0006-expose-split-logic-in-api&#34;&gt;0006: Expose Split Logic in API&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Karsten Jeschkies (&lt;a href=&#34;mailto:karsten.jeschkies@grafana.com&#34;&gt;karsten.jeschkies@grafana.com&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 03/2025&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sponsor(s):&lt;/strong&gt; @trevorwhitney&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; API&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Review&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related issues/PRs:&lt;/strong&gt; N/A&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thread from &lt;a href=&#34;https://groups.google.com/forum/#!forum/lokiproject&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;mailing list&lt;/a&gt;:&lt;/strong&gt; N/A&lt;/p&gt;
&lt;hr /&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;Loki has an internal logic to split and shard log and metric queries by time into multiple queries. However, this logic is not
accessible outside of the code base. This proposal intends to create an API for clients to split queries by exposing the
internal split logic.&lt;/p&gt;
&lt;p&gt;A split query is divided by time. The results of a split query can be concatenated in order to form the final
result.&lt;/p&gt;
&lt;p&gt;A sharded query is divided by label values. The results of a sharded cannot always be concatenated but require some
extra logic to form the final result. Some queries, such as &lt;code&gt;topk&lt;/code&gt; cannot be sharded at all.&lt;/p&gt;
&lt;h2 id=&#34;problem-statement&#34;&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;Loki clients such as the Grafana Loki datasource or the &lt;a href=&#34;https://github.com/trinodb/trino/tree/master/plugin/trino-loki&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Trino Loki
connector&lt;/a&gt; benefit from splitting LogQL queries into multiple sub-queries either to process
smaller chunks or to distribute work on query results.&lt;/p&gt;
&lt;p&gt;Splitting a query requires parsing the LogQL query first but there are no parsers for other languages except Go and
JavaScript.&lt;/p&gt;
&lt;h2 id=&#34;goals&#34;&gt;Goals&lt;/h2&gt;
&lt;p&gt;The intended goal is to enable any client to split a query into multiple sub-queries that can be either executed
sequentially or in parallel. The joined result of the sub-queries must be the same as executing the same query.&lt;/p&gt;
&lt;h2 id=&#34;non-goals&#34;&gt;Non-Goals&lt;/h2&gt;
&lt;p&gt;This proposal does not aim to provide pagination for query results.&lt;/p&gt;
&lt;h2 id=&#34;proposals&#34;&gt;Proposals&lt;/h2&gt;
&lt;h3 id=&#34;proposal-0-do-nothing&#34;&gt;Proposal 0: Do nothing&lt;/h3&gt;
&lt;p&gt;Without an API each client will have to use a LogQL parser.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Pros&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The split logic in Loki can be changed at will without breaking client behavior.&lt;/li&gt;
&lt;li&gt;There is no maintanence overhead for an API.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Cons&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Currently, the LogQL grammar is specific to Go. It is not easy to port it and the parser to other languages.&lt;/li&gt;
&lt;li&gt;Any changes to the splitting logic must be implemented for each client/platform.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;proposal-1-expose-splitting-in-an-api&#34;&gt;Proposal 1: Expose Splitting in an API&lt;/h3&gt;
&lt;p&gt;A new endpoint &lt;code&gt;GET /loki/api/v1/split_query&lt;/code&gt; is introduced that takes a &lt;code&gt;splits&lt;/code&gt; parameter and the same parameters as the &lt;a href=&#34;/docs/loki/latest/reference/loki-http-api/#query-logs-within-a-range-of-time&#34;&gt;/loki/api/v1/query_range&lt;/a&gt; endpoint. The new endoint returns sub-queries split by time.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;splits&lt;/code&gt; parameter optionally defines the number of desired splits. The API is allowed to return fewer splits than requested.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;limit&lt;/code&gt; parameter has extended semantics. Setting it to &lt;code&gt;0&lt;/code&gt; for a log stream query indicates to query all logs.&lt;/p&gt;
&lt;p&gt;The response body is JSON encoded:&lt;/p&gt;

&lt;div class=&#34;code-snippet &#34;&gt;&lt;div class=&#34;lang-toolbar&#34;&gt;
    &lt;span class=&#34;lang-toolbar__item lang-toolbar__item-active&#34;&gt;JSON&lt;/span&gt;
    &lt;span class=&#34;code-clipboard&#34;&gt;
      &lt;button x-data=&#34;app_code_snippet()&#34; x-init=&#34;init()&#34; @click=&#34;copy()&#34;&gt;
        &lt;img class=&#34;code-clipboard__icon&#34; src=&#34;/media/images/icons/icon-copy-small-2.svg&#34; alt=&#34;Copy code to clipboard&#34; width=&#34;14&#34; height=&#34;13&#34;&gt;
        &lt;span&gt;Copy&lt;/span&gt;
      &lt;/button&gt;
    &lt;/span&gt;
    &lt;div class=&#34;lang-toolbar__border&#34;&gt;&lt;/div&gt;
  &lt;/div&gt;&lt;div class=&#34;code-snippet &#34;&gt;
    &lt;pre data-expanded=&#34;false&#34;&gt;&lt;code class=&#34;language-json&#34;&gt;{ 
  &amp;#34;resultType&amp;#34;: &amp;#34;matrix&amp;#34; | &amp;#34;streams&amp;#34; | &amp;#34;vector&amp;#34;,
  &amp;#34;subqueries&amp;#34;: [
    {
      start: &amp;lt;timestamp nanoseconds&amp;gt;,
      end: &amp;lt;timestamp nanoseconds&amp;gt;,
      limit: &amp;lt;number&amp;gt;,
      query: &amp;lt;query string&amp;gt; 
    },
    {
      start: &amp;lt;timestamp nanoseconds&amp;gt;,
      end: &amp;lt;timestamp nanoseconds&amp;gt;,
      limit: &amp;lt;number&amp;gt;,
      query: &amp;lt;query string&amp;gt; 
    }
  ]
}&lt;/code&gt;&lt;/pre&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Pros&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clients can split queries independent on the implemation language and platform.&lt;/li&gt;
&lt;li&gt;Split logic is controlled by Loki and not the client. This means it can be improved, for example, by introducing sharding
labels.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Cons&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A new API endpoint increases the compatiblity surface area and thus maintanence overhead for Loki maintainers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;proposal-2-support-apache-arrow-flight-rpc&#34;&gt;Proposal 2: Support Apache Arrow Flight RPC&lt;/h3&gt;
&lt;p&gt;Loki could support Apache &lt;a href=&#34;https://arrow.apache.org/docs/format/Flight.html&#34; target=&#34;_blank&#34; rel=&#34;noopener noreferrer&#34;&gt;Arrow Flight RPC&lt;/a&gt; which is designed to
exchange large data sets in shards between services.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Pros&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Supporting an open standard comes with support for other non-Loki clients.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Cons&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Loki would have to support Apache Arrow which make the implementation more complicated.&lt;/li&gt;
&lt;li&gt;Arrow Flight RPC assumes the data is being queried on the first request. Which means all shards are available at the
same time. However, the intent of this document is that shards can be queried independently.&lt;/li&gt;
&lt;/ul&gt;
]]></content><description>&lt;h1 id="0006-expose-split-logic-in-api">0006: Expose Split Logic in API&lt;/h1>
&lt;p>&lt;strong>Author:&lt;/strong> Karsten Jeschkies (&lt;a href="mailto:karsten.jeschkies@grafana.com">karsten.jeschkies@grafana.com&lt;/a>)&lt;/p>
&lt;p>&lt;strong>Date:&lt;/strong> 03/2025&lt;/p>
&lt;p>&lt;strong>Sponsor(s):&lt;/strong> @trevorwhitney&lt;/p>
&lt;p>&lt;strong>Type:&lt;/strong> API&lt;/p>
&lt;p>&lt;strong>Status:&lt;/strong> Review&lt;/p>
&lt;p>&lt;strong>Related issues/PRs:&lt;/strong> N/A&lt;/p>
&lt;p>&lt;strong>Thread from &lt;a href="https://groups.google.com/forum/#!forum/lokiproject" target="_blank" rel="noopener noreferrer">mailing list&lt;/a>:&lt;/strong> N/A&lt;/p></description></item></channel></rss>