Menu
Grafana Cloud

Sift investigations

Sift is a powerful diagnostic assistant in Grafana Cloud designed to perform investigations on your infrastructure telemetry, helping you identify critical details during incidents. By employing a series of individual checks, Sift examines specific aspects of your infrastructure during investigations, providing valuable insights to guide your incident response efforts.

Before you begin

  • If needed, have an administrator initialize Grafana Machine Learning.

Sift checks

Sift offers a range of checks to analyze your system’s telemetry during investigations. These checks currently include:

  • Error Pattern Logs: Analyzes error logs and identifies groups of similar log lines, highlighting groups with significantly increased log rates based on shared patterns.

  • Kube Crashes: Detects recent container crashes by analyzing Kubernetes metrics and provides information on the cause of the crash (e.g., Error, OOMKill, etc.).

  • Noisy Neighbors: Identifies over-saturated hosts where load exceeds CPU core count, leading to high latency, and examines pods on those hosts for deeper insights into the underlying issues.

  • Recent Deployments: Identifies resources that recently underwent changes in Kubernetes, such as service updates or configuration modifications.

  • Resource Contention: Focuses on containers with significant CPU throttling due to reaching CPU limits, or significant packet loss due to networking issues. Unlike noisy neighbors, CPU throttling is caused by the container itself and not by other processes on the underlying infrastructure.

  • Slow Requests: Analyzes traces in Tempo (Grafana’s distributed tracing system) to identify requests taking longer than a specified threshold (default: 3 seconds).

  • HTTP Error Series: Checks for series exhibiting elevated HTTP errors within a specified cluster and namespace.

Running a Sift investigation

Sift investigations can be started from various locations in Grafana. In all cases, Sift requires some inputs so that it can look in the right places for issues. These inputs are labels, such as cluster, namespace or container, and a time range.

Note

In most cases Sift doesn’t require any specific labels to run an investigation, but investigations with labels such as cluster and namespace will find the best results.

Investigations can be started from:

  • Grafana Explore: use the + Add button in the toolbar and choose Run investigation. Sift will extract labels from the query and use the current Explore time range.

  • Grafana dashboards: use the dropdown on a panel and choose Run investigation. Sift will extract labels from the query and use the current dashboard time range.

  • Grafana Incident: see the Sift in Grafana Incident section below.

Make sure you have enabled Grafana Machine Learning before running an investigation. See Enable Grafana Machine Learning for more information.

Note

Currently Sift will only extract labels from PromQL queries in Explore/dashboard panels, but support for more data sources will be added in future. In the meantime you can manually add labels to the investigation using the form.

Label Management

Sift uses the provided labels to identify the scope of investigation and discover issues.

Auto-discovering datasources

While the default datasource to be used can be configured for every Sift check, Sift is capable of autodiscovering datasources based on provided labels.

Sift queries all Prometheus, Loki and Tempo datasources configured in Grafana for the labelset provided and identifies the right datasources based on number of matching series/streams. If the provided labelset matches too many series/streams, Sift will not run the investigation because a large scope can lead to noisy results and less value.

Label usage by Sift checks

Sift checks use different combinations of the provided labelset depending on their scope of operation. Checks like ‘Error Pattern Logs’ will use the complete labelset and analyse the resulting Loki streams, while checks like ‘Kube Crashes’ will use just ‘cluster’ and ’namespace’ (or ‘k8s.cluster.name’ and ‘k8s.namespace.name’) labels among the supplied labelset to query Prometheus for crashed pods.

Label filtering

Since Sift uses the provided labels in Prometheus/Loki queries as described above, it is important to filter out labels that not helpful. Sift will automatically filter out the following labels: grafana_folder, account_id, ref_id, alertname, severity, datasource_uid, filename and mountpoint.

Any labels containing whitespace in the key or value field are also filtered out for the same reason.

Viewing investigation results

Sift investigations can be viewed in the Grafana Machine Learning page. In the Alerts & IRM category of the sidebar, click Machine learning then View investigations. Your investigations will be listed and can be filtered from the toolbar.

Click an investigation to view the results. The checks are shown in a column on the left, grouped by status:

  • Interesting results contains checks which found something potentially useful.
  • Completed checks contains checks which ran and determined that nothing unusual had happened during the investigation.
  • Failed checks contains checks which failed to run for any reason.

Click a check to view the results. Each check has a custom-built UI designed to convey the information surfaced by the check.

Sift in Grafana Incident

Note

cluster and namespace are currently required to initiate a Sift investigation from Grafana Incident.

You can use Sift investigations in Grafana Incident to get valuable suggestions while working to resolve an active incident. Currently, there are two ways you can leverage Sift within Grafana Incident:

  • Run a Sift investigation within an incident: From the Suggestions section in the right sidebar of the incident timeline, click Start Sift investigation. Manually enter the cluster and namespace to start a Sift investigation specifically tailored to the incident.

  • Add some context to the Incident timeline: link to a dashboard, Explore query, alert rule or OnCall alert rule, and Sift will automatically extract cluster and namespace labels and start investigations.

Note

When a Sift investigation is triggered from within an incident, the Timerange is automatically set to the incident start time through the time investigation is triggered.

View and manage Sift suggestions

When a Sift check identifies interesting results, clickable links appear in the right sidebar under Suggestions. Click these links to review detailed information about the specific Sift check.

You can add important Sift suggestions directly to the main Incident timeline. Alternatively, if a Sift check result is deemed irrelevant, you can dismiss it from the suggestions.