Menu
Grafana Cloud

Sift configuration

Background

By default, a Sift investigation will attempt to run each of its checks once using the default values for each of that check’s fields. The results of each check will be shown in the investigation page, with the check’s name shown on the left of the page.

Note

Some checks require specific sets of labels to run. If those labels aren’t present on a given investigation, the check will be skipped. These labels are documented per-check in the check configuration section below.

Configuring Sift

Note

Sift’s configuration can only be edited by users with the Editor or Admin role.

The Sift configuration page lists the checks that will currently attempt to run along with their current configuration. Checks can be disabled by clicking the Disable button.

Some checks allow you to customise their parameters. This can be used to alter the title, override the datasource, increase the sensitivity or reduce the noise of Sift checks, for example.

To do so, click the Edit button next to the check instance. The modal that appears shows the current values for each setting, with the default value shown in the placeholder if no custom value has been set. See the tooltips or the documentation below for details on each setting.

All check instances have a Title field which determines how the check instance is referred to in investigations. This can be customised to provide more detail in specific cases.

Many checks also contain a Datasource field. By default, Sift will automatically detect the best instance of a datasource for the check by searching through the available datasources in Grafana for the instance with the most series, streams, or labels matching the current investigation’s labels. You may tell Sift to skip this check and always use a specific datasource by setting this field.

Running checks multiple times

In some cases you may want Sift to run a check more than once with different configurations for each instance. An example of this could be searching for patterns in error logs with different initial queries, or having multiple ‘slow requests’ checks with different thresholds to identify extremely slow requests separately.

Clicking the + Add button will create a new instance of a check with the default configuration. We recommend changing the title of the check instances to make them easier to distinguish when viewing investigation results.

Check configuration

Error Pattern Logs

Required labels: none

Maximum examples

  • Default: 3
  • Minimum: 1
  • Maximum: 10

The maximum number of example logs to show for each pattern found.

Minimum count

  • Default: 5
  • Minimum: 1
  • Maximum: 10

The minimum number of log occurrences before a pattern is considered interesting. Decreasing this number will increase the sensitivity of the check, with more patterns being considered interesting. Increasing will have the opposite effect, with fewer patterns appearing in the results.

Initial Query

  • Default: !~ "debug|DEBUG|info|INFO" |~ "error|ERROR"

The query used to find error logs.

This could be customized to only search for HTTP error logs, for example.

Kube Crashes

Required labels: namespace.

This check has no configurable parameters except for the Prometheus and Loki datasources.

Noisy Neighbors

Required labels: cluster and namespace.

Load threshold

  • Default: 100%
  • Minimum: 30%
  • Maximum: 100%

The threshold above which nodes will be considered to have ‘high load’.

Usage quantile

  • Default: 0.8
  • Minimum: 0.5
  • Maximum: 0.99

The quantile used to determine if a pod is using too much of a specific resource.

Recent Deployments

Required labels: namespace.

This check has no configurable parameters except for the Prometheus datasource.

Resource Contentions

Required labels: cluster and namespace.

This check has no configurable parameters except for the Prometheus datasource.

Slow Requests

Required labels: none.

Threshold

  • Default: 3 seconds
  • Minimum: 1 second

The threshold above which traces are considered ‘slow’.

HTTP Error Series

Required labels: cluster and namespace.

Cut off time

  • Default: 90 minutes
  • Minimum: 20 minutes
  • Maximum: 2 hours

The maximum time to look back for anomalies. Increase this value to look further in the past for erroring series, or decrease it to reduce false positives.

Threshold

  • Default: 60%
  • Minimum: 50%

The minimum percentage change of HTTP errors from the rolling average before a series is considered anomalous.