0002: Remote Rule Evaluation

Author: Danny Kopping (danny.kopping@grafana.com)

Date: 01/2023

Sponsor(s): @dannykopping

Type: Feature

Status: Accepted

Related issues/PRs: https://github.com/grafana/mimir/pull/1536

Thread from mailing list: N/A

Background

The ruler is a component that evaluates alerting and recording rules. Loki reuses Prometheus’ rule evaluation engine. The ruler currently operates by initialising a querier internally and evaluating all rules “locally” (i.e. it does not rely on any other components). Each rule group executes concurrently, and rules within the rule group are evaluated sequentially (this is an implementation detail from Prometheus).

Recording rules produce metric series which are sent to a Prometheus-compatible source. Alerting rules send notifications to Alertmanager when a condition is met. Both of these rule types can play a vital role in an organisation’s observability strategy, and so their reliable evaluation is essential.

Problem Statement

Rule evaluations can contain expensive queries. The ruler initialises a querier, but the querier does not have the capability to accelerate queries; the query-frontend component is responsible for query acceleration through splitting, sharding, caching, and other techniques.

An expensive rule query can cause an entire ruler instance to use excessive resources and even crash. This is highly problematic for the following reasons:

slow rule evaluations can lead to subsequent rules in a group to be delayed or missed, leading to missing alerts or gaps in recording rule metrics
excessive resource usage can impede the evaluation of rules for other tenants (noisy neighbour)

Goals

faster, more efficient rule evaluation
greater isolation between tenants
more reliable service

Non-Goals

This proposal does not aim to make this option the default mode of evaluation; it should be optional because it increases operational complexity.

Proposals

Proposal 0: Do nothing

Loki’s current ruler implementation is sufficient for small installations running relatively simple or inexpensive queries.

Pros:

Nothing to be done

Cons:

Loki’s ruler will remain unreliable and inefficient when used in large multi-tenant environments with expensive queries.

Proposal 1: Remote Execution

Taking inspiration from Grafana Mimir’s implementation, the ruler would be configured to send its rule query to the query-frontend component over gRPC. The querier instances receiving queries from the query-frontend (or optionally via the query-scheduler) will handle the request and send the responses to the query-frontend and be combined. The ruler will receive and process these responses as if the query had been executed locally.

Pros:

Takes full advantage of Loki’s query acceleration techniques, leading to faster and more efficient rule evaluation
Operationally simple as existing query-frontend/query-scheduler/querier setup can be used
Per-tenant isolation available in Loki’s query path (shuffle-sharding, per-tenant queues) can be used to reduce or eliminate the noisy neighbour problem

Cons:

Increased interdependence in components, increased cross-component networking
Reusing the same query-frontend/query-scheduler/querier setup can cause expensive queries to starve rule evaluations of query resources, and vice versa
- Additional complexity introduced if this setup needs to be duplicated for rule evaluations (recommended: see Other Notes section below)

Other Notes

If this feature were to be used in conjunction with rule-based sharding, this can present some further optimisation but also some additional challenges to consider.

Aside: the ruler shards by rule group by default, which means that rules can be unevenly balanced across ruler instances if some rule groups have more expensive queries than others. Another consequence of this is that rule groups execute sequentially, so expensive queries can cause subsequent rules in the group to be delayed or even missed. Rule groups are evaluated concurrently.

Rule-based sharding distributes rules evenly across all available ruler instances, each in their own rule group. Consequentially, each rule that belongs to a ruler instance will be evaluated concurrently (as they’re each in their own rule group). For tenants with hundreds or thousands of rules, this can result in large batches of queries being sent to the query-frontend in quick succession, should they all use the same interval or happen to overlap.

Assuming the remote rule evaluation takes place on the same read path that is used to execute tenant queries, care must be taken by operators who run large multi-tenant setups to ensure that large volumes of queries can be received, queued, and processed in an acceptable timeframe. The query-scheduler component is highly recommended in these situations, as it will enable the query-frontend and query components to scale out to accommodate the load. Shuffle-sharding should also be implemented to ensure that tenants with particularly large workloads do not starve out the query resources of other tenants. Alerting should also be put in place to notify operators if rule evaluations are being routinely missed or a tenants’ query queues become full.

If rule evaluations and tenant queries are slowing each other down, the read path setup would need to be duplicated so that tenant queries and rule evaluations would not share the same query execution resources.

Rule-based sharding and remote evaluation can (and should) be implemented separately. Operators should first implement remote evaluation to improve ruler reliability, and then further investigate rule-based sharding if rule evaluations are still being missed due to the sequential execution of rule groups, or advise their tenants to split these rule groups up.