Grafana Cloud

Custom labels

Synthetic Monitoring allows you to define custom labels in your check definitions.

Custom Labels

Query custom labels

Synthetic Monitoring uses the sm_check_info metric, which is an info metric, to present information about executed checks. Custom labels are applied to this metric.

All custom labels have a label_ prefix to ensure they don’t conflict with system-defined labels.

For example, if you create a custom label env with value prod, you can query the metric sm_check_info{label_env="prod"} to see the details for any checks with that custom label.

Query check metrics with custom labels

To apply custom labels to other metrics, you must use a Prometheus join. In short, PromQL supports the ability to join two metrics together: you can append a label set from one metric and to another at query time. That can be useful in Prometheus rule evaluations since it lets you generate a new metric for a series by appending labels from another info metric.

There are many resources where you can learn more about Prometheus joins:

The following section provides a Synthetic Monitoring example.

Example

This example shows how to create an alert rule that triggers if the duration of a probe exceeds 250 ms for production environments.

This example uses the probe_duration_seconds metric for demonstration, but the information here applies to all the metrics produced by Synthetic Monitoring.

probe_duration_seconds, like all Synthetic Monitoring metrics, has four labels:

  • config_version
  • instance
  • job
  • probe

Some metrics have additional labels, but all of them include these four.

To differentiate between development environments, this example uses a custom label called environment with the value prod.

Define a new HTTP check

Start by defining a new HTTP check with https://www.grafana.com/ as the target and example as the job name, and select a few locations. You can change the target and job to something more relevant to you. For the rest of this example, https://www.grafana.com/ is the instance and example is the job.

In the Advanced options section, add one custom label. For this example, the name is environment and the value is prod.

Note

Alternatively, you can also use an existing check. Just note the values of instance and job and make sure it has at least one custom label defined.

Let the check execute for a couple of minutes so that Grafana Cloud can collect some metrics.

Use Explore to examine the probe_duration_seconds metric

In your Grafana Cloud account, go to the Explore tab and enter the following expression:

promql
probe_duration_seconds{job="example", instance="https://www.grafana.com/"}

The expression returns the metrics for that job/instance pair, one for each location you have selected.

Notice that the custom label isn’t included in the result.

Join with sm_check_info to retrieve the custom labels

The custom labels are applied to sm_check_info. That means you need to join your metric of interest with this one.

Change the previous expression to look like this:

promql
probe_duration_seconds{job="example", instance="https://www.grafana.com/"}
*
on (job, instance, probe, config_version)
sm_check_info{job="example", instance="https://www.grafana.com/"}

That produces the exact same result as before.

Note

Unpacking the new expression:

  • on (job, instance, probe, config_version) is saying “use these labels to join the values”.
  • The sm_check_info metric always has the value 1.
  • Since the operation is *, this produces the same values.

You can find more information about this topic in the binary operator and vector matching keyword sections of the Prometheus documentation.

Propagate custom labels from the sm_check_info metric to the final result

Update the query like this:

promql
probe_duration_seconds{job="example", instance="https://www.grafana.com/"}
*
on (job, instance, probe, config_version)
group_right ()
sm_check_info{job="example", instance="https://www.grafana.com/"}

Now you have the custom labels in the result and some other labels you might not want, such as config_version.

Note

PromQL allows you to perform many-to-one and one-to-many joins using one of two keywords. The question you need to ask is, “Which side of the operator has many elements and which side has one element” (for each set of the joining labels).

  • Many-to-one is done using the group_left modifier.
  • One-to-many is done using the group_right modifier.

In this example, you have a one-to-one relation: the same labels exist on the left and the right sides, with additional labels on the right that have single values. That’s why this example uses group_right.

Drop excess labels

To drop the excess labels, you can change the expression to:

promql
sum without (config_version) (
  probe_duration_seconds{job="example", instance="https://www.grafana.com/"}
  *
  on (job, instance, probe, config_version)
  group_right ()
  sm_check_info{job="example", instance="https://www.grafana.com/"}
)

sum without (config_version) says “sum all the values, grouping them by their labels, ignoring config_version”.

Note

Why would you want to drop config_version?

Because the value of that label changes every time you modify the check (for example, when you add a location or modify the frequency). If you leave it there, any time you modify the check, you end up with two different series: one that ends before the modification and one that starts after the modification. To make the series continuous again, you drop config_version.

Note that the result also includes frequency and geohash labels. They come from sm_check_info. frequency corresponds to the interval in milliseconds between each run of the check. If you ever change that, you will have a discontinuity in the result. geohash comes from the probe’s configuration and represents the geographic coordinates of the location where the probe runs. While this is unlikely to change, it might change. That’s why you might want to drop these labels as well using this expression:

promql
sum without (config_version, frequency, geohash) (
  probe_duration_seconds{job="example", instance="https://www.grafana.com/"}
  *
  on (job, instance, probe, config_version)
  group_right ()
  sm_check_info{job="example", instance="https://www.grafana.com/"}
)

At this point, you have multiple time series, one per location where the check is running.

Besides the job, instance, and probe labels, the result has label_env with prod as its value.

Create the final expression that you can alert on

Add a threshold to the expression like this:

promql
sum without (config_version, frequency, geohash) (
  probe_duration_seconds{job="example", instance="https://www.grafana.com/"}
  *
  on (job, instance, probe, config_version)
  group_right ()
  sm_check_info{job="example", instance="https://www.grafana.com/"}
)
> 0.250

If you use this expression in an alert rule, it will trigger if the duration of the probe exceeds 250 ms. You can use label_env for alert routing decisions.

Optionally filter based on custom labels

As explained before, custom labels are applied to the sm_check_info metric. If you want to filter based on them, apply the filter there.

Following the example, modify the query like this:

promql
sum without (config_version, frequency, geohash) (
  probe_duration_seconds{job="example", instance="https://www.grafana.com/"}
  *
  on (job, instance, probe, config_version)
  group_right ()
  sm_check_info{job="example", instance="https://www.grafana.com/", label_env="prod"}
)

That doesn’t change the result because the query still specifies the job and instance labels. In other words, this limits the result to a single check.

This means you can drop some of the labels, depending on what you want to achieve. For example, if you have consistently applied env="prod" to all checks that are relevant to the production environment, you could use something like this:

promql
sum without (config_version, frequency, geohash) (
  probe_duration_seconds{}
  *
  on (job, instance, probe, config_version)
  group_right ()
  sm_check_info{label_env="prod"}
)
> 0.250

That will produce a time series per location for every check that has that label defined on it. The alert will trigger if any check fails to meet the threshold.