Menu
Grafana Cloud

SLI example for availability

This guide provides examples to show how to define availability SLIs based on successful HTTP responses and probe results. The examples explain various methods to define SLIs using distinct Prometheus metric types.

First, it’s necessary to understand what Grafana SLO expects from an SLI.

Grafana SLO supports event-based SLIs (also known as ratio-based SLIs), which measure the ratio of successful events to total events:

Success rate = number of successful events (over a period)
               /
               total number of events (over a period)

SLI queries can also be defined as the ratio of successful event rates:

Success rate = rate of successful events (over a period)
               /
               rate of total events (over a period)

Note

The SLI query result must return a ratio between 0 and 1, where 1 means 100% of events were successful. This is required to evaluate whether the SLI meets the SLO target.

Screenshot of the graph result of an SLI ratio

In the Grafana SLO wizard, you can create these SLIs using two methods:

  • Ratio query builder: Enter a counter metric for success events and a counter metric for total events, and it auto-generates the final SLI query.
  • Advanced: Enter the ratio SLI query directly.

HTTP availability (using Prometheus Counter)

HTTP availability is a common SLI for frontend and API services, defining availability as requests that do not return server errors (5xx status codes).

  • Number of successful events: All non-5xx requests
  • Total number of events: All HTTP requests

This example uses http_requests_total, a Prometheus counter metric that counts the number of HTTP requests by status code, method, and other labels.

This metric can be used to calculate the SLI as number of successful events / total number of events, using the Ratio option in the Grafana SLO wizard:

Ratio query builderValueDescription
Success metrichttp_requests_total{status!~"5.."}Metric for success requests
Total metrichttp_requests_totalMetric for total requests
Grouping(leave empty)Creates a single SLI dimension

See the multidimensional SLI example

Click Run queries to generate the final SLI ratio query:

promql
(
  sum(rate(http_requests_total{status!~"5.."}[$__rate_interval] offset 2m))
  or 0 * sum(rate(http_requests_total[$__rate_interval] offset 2m))
)
/
sum(rate(http_requests_total[$__rate_interval] offset 2m))

Breakdown of the ratio SLI query

The auto-generated SLI query includes a few common additions for building reliable SLIs.

This section breaks down how it works. The SLI definition conceptually transitions from:

Success rate = number of successful events / total number of events

->

Success rate = rate of successful events / rate of total events

The SLI query result is still the same: a ratio between 0 and 1, where 1 means 100% of events were successful. This ratio is required for evaluating an SLO target.

Rate of successful events

The following part of the query measures successful events over time:

promql
(
  sum(rate(http_requests_total{status!~"5.."}[$__rate_interval] offset 2m))
  or 0 * sum(rate(http_requests_total[$__rate_interval] offset 2m))
)
  • http_requests_total{status!~"5.."}: Returns only successful requests.

  • rate(...[$__rate_interval]): Calculates the per-second rate of successful requests over the recommended rate interval.

  • offset 2m: Shifts the query two minutes into the past to account for scrape or data ingestion delays.

  • sum(...): Aggregates across all series (dimensions) to get the total success rate.

  • or 0 * sum(rate(http_requests_total...)): Fallback for missing data. If the success metric returns no data, it returns 0 when dividing by total events.

This numerator is then divided by the rate of total requests.

Rate of total events

promql
/
sum(rate(http_requests_total[$__rate_interval] offset 2m))

This query follows the rate formula and applies the same offset for alignment with the numerator (rate of successful events).

The full query returns a ratio between 0 and 1, representing the proportion of successful requests. The Grafana SLO wizard displays the final SLI query and a graph of its results:

Screenshot of the Grafana SLO wizard creating an SLI for HTTP availability

Note that you can also use the Advanced SLI option to create the same SLI query directly.

Probe availability (using Prometheus Summary)

This example uses Synthetic Monitoring probes, such as local probes or Grafana Cloud probes, to verify service availability.

The process is as follows:

  1. Configure a synthetic check from one or more probe locations that continuously verify system availability.

    The check runs regularly, based on the configured frequency, and stores its results in Prometheus.

  2. Define an SLO whose SLI queries the Prometheus probe results.

    Grafana SLO then evaluates the SLI query and reports the SLO compliance.

This example uses the probe_all_success metric, a summary metric that tracks whether the probe succeeded.

Prometheus summary metrics expose additional *_sum and *_count series. These can be used to define the SLI as number of successful events / total number of events using the Ratio option in the Grafana SLO wizard:

Ratio query builderValueDescription
Success metricprobe_all_success_sum{job="<JOB_NAME>"}Number of successful probes
Total metricprobe_all_success_count{job="<JOB_NAME>"}Total number of probe executions
Grouping(leave empty)Creates a single SLI dimension

See the multidimensional SLI example

Click Run queries to generate the final SLI ratio query:

promql
(
  sum(rate(probe_all_success_sum{job="<JOB_NAME>"}[$__rate_interval] offset 2m))
  or 0 * sum(rate(probe_all_success_count{job="<JOB_NAME>"}[$__rate_interval] offset 2m))
)
/
sum(rate(probe_all_success_count{job="<JOB_NAME>"}[$__rate_interval] offset 2m))

The SLI query returns a ratio between 0 and 1, where 1 means 100% of probe executions were successful.

This example works exactly like the HTTP availability example. To learn why the SLI is formed this way and how it works, refer to the breakdown of the ratio SLI query .

Probe availability (using Prometheus Gauge)

Sometimes, a binary gauge metric is used to track successes, such as the probe_success metric.

  • probe_success is 1 on success.
  • probe_success is 0 on failure.

In the SLO wizard, the Ratio option expects a counter metric and cannot generate the correct ratio SLI for this case. Use the Advanced query option instead.

Define an SLI that returns the ratio of successes, represented as a value between 0 and 1, as in the previous examples. You can use the event-based success rate formula for this SLI:

Success rate = number of successful events (over a period)
               /
               total number of events (over a period)

The SLI can then be defined as follows:

promql
# `sum_over_time` sums the 1s to calculate the number of successful probes
sum(sum_over_time(probe_success{job="<JOB_NAME>"}[$__rate_interval]))
/
sum(count_over_time(probe_success{job="<JOB_NAME>"}[$__rate_interval]))
# `count_over_time` counts the total number of probe executions (1=success, 0=failure)
  • probe_success{job="<JOB_NAME>"}: Returns probe results for the specified job. Each sample is either 1 (success) or 0 (failure).
  • sum_over_time(...[$__rate_interval]): Sums the values of successful probes over the given interval; failed probes are counted as 0.
  • count_over_time(...[$__rate_interval]): Counts all probe executions in the given interval, including both successes (1) and failures (0).
  • sum(...): Aggregates across all series (dimensions) to get the total number of successful probes and total probe executions.

Like the other SLI examples, this SLI returns a value between 0 and 1, representing the ratio of successful executions.

Alternatively, avg_over_time can also be used to calculate the same type of SLI:

promql
avg(avg_over_time(probe_success{job="<JOB_NAME>"}[$__rate_interval]))

This is also a valid SLI because the binary gauge metric only returns 0 and 1:

  • avg_over_time(...[$__rate_interval]): Returns the average success rate over the interval, a number between 0 and 1 representing the rate of successes.
  • avg(...): Aggregates across all series (dimensions).

Availability using failure metric

Sometimes, you might have a metric that counts failures instead of successes, because your instrumentation doesn’t use a single metric with a label indicating success or failure.

For example, you may only have the following counters:

  • failure_total: Counts failed requests or operations.
  • all_total: Counts all requests or operations.

In this case, you can calculate availability by subtracting failures from the total:

Success rate = (total events - failed events)
               /
               total events

Avoid using 1 - (<failure rate> / <total rate>) because it can return NaN on missing data, and the SLI can’t be parsed as a success ratio. This limits certain SLO dashboard visualizations and filtering.

In the Grafana SLO wizard, use the Advanced option:

promql
# rate of successful events
(
  sum(rate(all_total[$__rate_interval]))
  -
  sum(rate(failure_total[$__rate_interval]))
)
/
sum(rate(all_total[$__rate_interval]))
# rate of total events

The SLI query result is still the same: a ratio between 0 and 1, where 1 means 100% of events were successful.