Alerts and IRM

Machine Learning

Sift

Sift configuration

Grafana Cloud

Sift configuration

Background

By default, a Sift investigation will attempt to run each of its checks once using the default values for each of that check’s fields. The results of each check will be shown in the investigation page, with the check’s name shown on the left of the page.

Note
Some checks require specific sets of labels to run. If those labels aren’t present on a given investigation, the check will be skipped. These labels are documented per-check in the check configuration section below.

Configuring Sift

Note
Sift’s configuration can only be edited by users with the Editor or Admin role.

The Sift configuration page lists the checks that will currently attempt to run along with their current configuration. Checks can be disabled by clicking the Disable button.

Some checks allow you to customize their parameters. This can be used to alter the title, override the datasource, increase the sensitivity or reduce the noise of Sift checks, for example.

To do so, click the Edit button next to the check instance. The modal that appears shows the current values for each setting, with the default value shown in the placeholder if no custom value has been set. See the tooltips or the documentation below for details on each setting.

All check instances have a Title field which determines how the check instance is referred to in investigations. This can be customized to provide more detail in specific cases.

Many checks also contain a Datasource field. By default, Sift will automatically detect the best instance of a datasource for the check by searching through the available datasources in Grafana for the instance with the most series, streams, or labels matching the current investigation’s labels. You may tell Sift to skip this check and always use a specific datasource by setting this field.

Running checks multiple times

In some cases you may want Sift to run a check more than once with different configurations for each instance. An example of this could be searching for patterns in error logs with different initial queries, or having multiple ‘slow requests’ checks with different thresholds to identify extremely slow requests separately.

Clicking the + Add button will create a new instance of a check with the default configuration. We recommend changing the title of the check instances to make them easier to distinguish when viewing investigation results.

Limiting when Sift checks run

Sift allows you to limit when an instance of a check runs based on the labels of an investigation. The conditions for a check run are expressed using PromQL selectors, e.g. app="shopping-cart" or environment=~"prod.+". You can combine conditions using ‘AND’ and ‘OR’ to ensure checks only run exactly when you need them to.

Each check’s config modal contains a ‘Conditions’ section which can be used to express this. To use it, first click the + Add condition button in the config modal. This adds a condition with some inputs for label names and values; within this condition, every label must match the selector for the check to run (the labels are combined using ‘AND’ logic). To express an ‘OR’, click the +Add condition button again and add your second condition to the new input field.

For example, you may have a specific log query which you only want to run whenever an investigation triggers matching the PromQL selector {namespace="gateway", cluster=~"prod.+"}. To express this, click the +Add condition button once, then type ’namespace’ into the Label name box and ‘gateway’ into the Label value box. Next click + Add label and type ‘cluster’ into the Label name box, change the selector type to ~=, and type ‘prod.+’ into the Label value box.

Check configuration

Error Pattern Logs

Required labels: none

Maximum examples

Default: 3
Minimum: 1
Maximum: 10

The maximum number of example logs to show for each pattern found.

Minimum count

Default: 5
Minimum: 1
Maximum: 10

The minimum number of log occurrences before a pattern is considered interesting. Decreasing this number will increase the sensitivity of the check, with more patterns being considered interesting. Increasing will have the opposite effect, with fewer patterns appearing in the results.

Initial Query

Default: !~ "debug|DEBUG|info|INFO" |~ "error|ERROR"

The query used to find error logs.

This could be customized to only search for HTTP error logs, for example.

HTTP Error Series

Required labels: cluster and namespace.

Cut off time

Default: 90 minutes
Minimum: 20 minutes
Maximum: 2 hours

The maximum time to look back for anomalies. Increase this value to look further in the past for erroring series, or decrease it to reduce false positives.

Threshold

Default: 60%
Minimum: 50%

The minimum percentage change of HTTP errors from the rolling average before a series is considered anomalous.

Kube Crashes

Required labels: namespace.

This check has no configurable parameters except for the Prometheus and Loki datasources.

Log Query

Required labels: none.

Query

The custom LogQL query expression to run.

Message template

A Go template string used to format the output of the check.

The template string has access to the following variables:

expr: the input expression string
interesting: a boolean indicating whether this check found any interesting results
streams: an array of log streams. Each element has two fields:
- Labels, a map from label name to label value identifying the stream
- Entries, an array of log entries. Each element has two fields:
  - Timestamp, the timestamp of the log entry.
  - Line, the log line itself.

Max log lines

Default: 5
Minimum: 1

The maximum log lines to include for each stream in the output.

Metric Query

Required labels: none.

Query

The custom PromQL query expression to run.

Message template

A Go template string used to format the output of the check.

The template string has access to the following variables:

expr: the input expression string
interesting: a boolean indicating whether this check found any interesting results
streams: an array of time series. Each element has three fields:
- Labels, a Prometheus Metric implemented as a map from label name to label value used to identify the series.
- LastTimestamp, the latest timestamp found in the input query.
- LastValue, the latest value found in the input query.

Noisy Neighbors

Required labels: cluster and namespace.

Load threshold

Default: 100%
Minimum: 30%
Maximum: 100%

The threshold above which nodes will be considered to have ‘high load’.

Usage quantile

Default: 0.8
Minimum: 0.5
Maximum: 0.99

The quantile used to determine if a pod is using too much of a specific resource.

Threshold

Default: 3 seconds
Minimum: 1 second

The threshold above which traces are considered ‘slow’.

Feedback

Sift configuration

Background

Configuring Sift