Menu
Grafana Cloud

SLI example for latency

This guide provides examples of how to define latency SLIs using different Prometheus metric types. The basic SLO example for demonstration purposes is as follows:

SLI categorySLI descriptionTime windowTarget
LatencyRequests respond within 2 seconds28d99%

The SLI in this example includes all requests, and the SLO defines the target percentage.

When possible, avoid using percentiles in SLIs, such as 95th percentile latency with a 99% target, to maintain simplicity and consistency across SLO types. Refer to Building good SLOs—CRE life lessons from Google Cloud for more on this topic.

Before you begin, read the SLI availability examples to understand how SLIs are defined in Grafana SLO:

Note

The SLI query result must return a ratio between 0 and 1, where 1 means 100% of events were successful. This is required to evaluate whether the SLI meets the SLO target.

Screenshot of the graph result of an SLI ratio

Probe latency (using Prometheus Gauge)

This example uses the probe_duration_seconds metric from Synthetic Monitoring probes to verify public latency. For details on how Synthetic Monitoring probes work, see the SLI availability examples using probes.

MetricTypeDescription
probe_duration_secondsGaugeHow long the probe took to complete in seconds

In the Grafana SLO wizard, you can create SLIs using two options:

  • Ratio query builder: Enter counter metrics for successful and total events.
  • Advanced: Enter the ratio SLI query directly.

Because probe_duration_seconds is not a counter metric, choose the Advanced option to create the SLI query.

SLIs are defined as ratio-like queries, either as the ratio of successful events or the ratio of successful event rates:

# ratio of successful event rates formula
Success rate = rate of successful events (over a period)
               /  
               rate of total events (over a period)

# ratio of successful events formula
Success rate = number of successful events (over a period)
               /  
               total number of events (over a period)

With gauge metrics, you can implement the ratio of successful events formula as follows:

promql
# number of successful probe requests over the rate interval
sum(
  count_over_time(
    (probe_duration_seconds{job="<JOB_NAME>"} < 2)[$__rate_interval:]
  )
)
/
# number of total probe requests over the rate interval
sum(
  count_over_time(
    probe_duration_seconds{job="<JOB_NAME>"}[$__rate_interval:]
  )
)

Here’s the breakdown of the numerator query:

promql
# number of successful probe requests over the rate interval
sum(
  count_over_time(
    (probe_duration_seconds{job="<JOB_NAME>"} < 2)[$__rate_interval:]
  )
)
  • probe_duration_seconds{job="<JOB_NAME>"} < 2

    Returns probe latency samples. The < 2 comparison filters samples where latency is within the SLI threshold (less than two seconds).

    The result is a binary series: 1 for success and no sample for failure.

  • [$__rate_interval:]

    Runs the previous expression over the past $__rate_interval.

    Because count_over_time works only on range vectors, it uses a subquery [:] to produce a range vector containing all samples from that period.

  • count_over_time(...) Counts the number of samples in the previous query, the number of successful probe requests in the range vector.

  • Finally, sum(...) aggregates across all series (dimensions).

The numerator is then divided by the total number of probe requests over the same interval using a similar query:

promql
/
# number of total probe requests over the rate interval
sum(
  count_over_time(
    probe_duration_seconds{job="<JOB_NAME>"}[$__rate_interval:]
  )
)

Probe latency (using Histogram)

The SLI example uses the probe_all_duration_seconds histogram metric, whose SLI query is different.

MetricTypeDescription
probe_all_duration_secondsHistogramHow long the probe took to complete in seconds

Prometheus histogram metrics store samples based on their value (latency in this case) and expose additional series:

  • *_count: Returns all samples for all latencies.

  • *_bucket: Returns samples per configured buckets. The buckets for this metric are 0, 0.005, 0.1, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, and +Inf.

    Graph visualizing the different buckets of the `probe_all_duration_seconds` histogram metric

You can use a histogram metric to return the number of successful samples if the metric includes a bucket for the specific SLI threshold.

However, probe_all_duration_seconds does not include a bucket for 2s, and cannot be used to filter histogram samples at that threshold. For alternatives, refer to handle a threshold not available as a bucket.

This example uses a different threshold (2.5s) for demonstration purpose. Use the Ratio option to build the SLI query as follows:

Ratio query builderValueDescription
Success metricprobe_all_duration_seconds_bucket{job="<JOB_NAME>", le="2.5"}Number of probes requests under 2.5s
Total metricprobe_all_duration_seconds_count{job="<JOB_NAME>"}Total number of probe requests
Grouping(leave empty)Creates a single SLI dimension

See the multidimensional SLI example

Click Run queries to generate the final SLI ratio query:

Screenshot of the Grafana SLO wizard creating an SLI for latency using a Prometheus histogram metric

The auto-generated SLI implements the ratio of successful event rates formula:

Success rate = rate of successful events (over a period)
               /  
               rate of total events (over a period)

The SLI query returns a ratio between 0 and 1, where 1 means 100% of events were successful.

To learn why the auto-generated SLI is formed this way and how it works, refer to the breakdown of the ratio SLI query of the HTTP availability example.

Handle a threshold not available as a bucket

It is common for your SLI threshold to not match an existing histogram bucket, as in this example:

  1. The SLI searches for responses under 2 seconds.
  2. But the available buckets are configured for 1 and 2.5, not 2.

In this case, probe_all_duration_seconds_bucket{job="<JOB_NAME>", le="2"} does not work, and you should consider other approaches:

  • Add a bucket for your threshold: If you control the instrumentation, update the histogram metric to include a bucket for the exact SLI threshold.
  • Use a fallback metric: Check if a latency gauge metric is available like in the previous Gauge example.
  • Approximate using the nearest bucket: Use the nearest higher or lower bucket. Document this clearly and adapt your SLO settings, as the SLO no longer match the intended SLI threshold.