Best practices for meta-monitoring the Grafana Cloud Agent

Published: 18 Nov 2020

Note: Refer the official documentation page for any potential updates to the recommendations below.

Earlier this year, we introduced the Grafana Cloud Agent, a subset of Prometheus built for hosted metrics that runs lean on memory and uses the same service discovery, relabeling, WAL, and remote_write code found in Prometheus. Thanks to trimming down to the parts only needed for interaction with Cortex, tests of our first release have seen up to a 40% memory-usage reduction compared to an equivalent Prometheus process.

But what are the meta-monitoring best practices when using the Grafana Cloud Agent?

With the Prometheus pull model, there is a straightforward way of monitoring when a node is unavailable. The up metric is automatically generated for each scrape target. The value is 1 when the target is reachable, and 0 when the target is not reachable.

The Grafana Cloud Agent reverses the model into a push model. This means that the “up” metric has a value of 1 when the Agent is running, and non-existent when the Agent is down. The problem with the absence of data is the alerting engine depends on the alert expression returning something in order to fire an alert. You also need a number of data points with a value of 0 to compute a percentage of uptime (sum of values) over a period of time (count of values).

This raises the question of how to monitor your monitoring system if the collector agent or the host it’s running on has a problem. If you host your metrics on Grafana Cloud, then the Grafana Labs team monitors the availability of your metrics once they are sent to the cloud. You only have to worry about monitoring the agents themselves. (Sign up for a free 30-day trial of Grafana Cloud here.)

By contrast with Prometheus, you would set up high availability by running two Prometheus instances in parallel and meta-monitoring by having your Prometheus instances scrape each other. This is not always the best solution when using the Grafana Cloud Agent. The whole point of the Agent is that each node reports itself to the cloud in a push model.

This means we need to come up with new solutions and best practices for the use case of monitoring the monitoring agent.

Solution 1: max_over_time(up[]) unless up

This clever PromQL expression will return a vector for a time series that suddenly stops existing:

max_over_time(up{job="integrations/agent"}[1h]) unless up

We take advantage of the range vector in max_over_time() which will catch all series that have existed in that time range. The particular _over_time() function that we use doesn’t really matter because up always has a value of 1 here.

We use the unless logical operator which is described in the Prometheus documentation:

vector1 unless vector2 results in a vector consisting of the elements of vector1 for which there are no elements in vector2 with exactly matching label sets. All matching elements in both vectors are dropped.

If a series was present in the past hour and is currently present, it is not returned in the result set. If a vector was present within the past hour and is currently not present, it is returned in the result set.

Thus, series that have existed recently and do not currently exist will be returned with a value of 1.

The alert will fire for at most 1 hour after the metric stops being reported, or whatever time range you set in the alert expression.

Pros:

  • Quick to implement
  • Picks up a down host instantly

Cons:

  • Only fires an alert for the duration of the range vector
  • Any label values that change cause the expression to fire an alert

Make sure to enable the agent integration in the configuration to have this metric reported in the first place.

integrations:
  agent:
    enabled: true

Solution 2: absent()

From the absent() function documentation:

The absent() function returns a 1-element vector with the value 1 if the vector passed to it has no elements. This is useful for alerting on when no time series exist for a given metric name and label combination.

While you can’t write a single alert expression that covers all of your hosts, using a templating engine and some automation, you can create a rule group with as many alert rules as you have things to monitor.

For example, this absent_alert mixin in Kubernetes creates an individual Prometheus alert for components that need to be alerted on when they stop reporting.

Here is an example of an alert rule group:

groups:
  - name: HostAvailability
    rules:
      - alert: HostAvailability
        expr: absent(up{agent_hostname="host01",environment="dev",region="us"})
        for: 10m
        labels:
          severity: critical
        annotations:
          description: |-
            Host availability for VM resources
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
          summary: Host is down (host01)
      - alert: HostAvailability
        expr: absent(up{agent_hostname="host02",environment="prod",region="au"})
        for: 10m
        labels:
          severity: critical
        annotations:
          description: |-
            Host availability for VM resources
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
          summary: Host is down (host02)
      - alert: HostAvailability
        ...

Here is an example query to calculate the percentage of uptime over one week:

count_over_time(up{job="integrations/agent"}[1d:])
/ on () group_left
count_over_time(vector(1)[1d:])

For this to work, every instance must have been up for at least one sampling interval during the time range selected. Otherwise, you will have to hardcode the label values in as many queries as you have instances. For example:

sum by (agent_hostname) (
  sum_over_time(
    (
    0 * absent(up{job="integrations/agent",agent_hostname="host01"})
    or
    up{job="integrations/agent",agent_hostname="host01"}
    )[1d:]
  )
)
/ on () group_left
count_over_time(vector(1)[1d:])

Solution 3: blackbox_exporter

You can do an ICMP check with a blackbox_exporter. This is a return to the pull model, and in a way, it beats the purpose of using the Grafana Cloud Agent. In some environments, it may not be possible due to network firewall restrictions. Black box monitoring with an external agent would be “the right way” of checking if a host or a service is up.

blackbox_exporter configuration:

modules:
  ping_check:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

Prometheus configuration:

scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [ping_check]
    static_configs:
      - targets:
        - host01@clusterA@prod
        - host02@clusterA@dev
        - host03@clusterB@dev
    relabel_configs:
      - source_labels: [__address__]
        regex: (.*)@.*@.*
        replacement: $1
        target_label: server_agent_hostname
      - source_labels: [__address__]
        regex: .*@(.*)@.*
        replacement: $1
        target_label: server_cluster
      - source_labels: [__address__]
        regex: .*@.*@(.*)
        replacement: $1
        target_label: server_environment
      - source_labels: [server_agent_hostname]
        target_label: __param_target
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

This way of passing multiple pieces of information via scrape targets and regular expression is described on my personal blog: Supercharge your blackbox_exporter modules.

Then we use this relabeling config in the remote_write section of the Grafana Cloud Agent:

remote_write:
  write_relabel_configs:
    - regex: "server_(.*)"
        replacement: "$1"
        action: labelmap
    - regex: "server_.*"
        action: labeldrop

Solution 4: Use service discovery to register agents

You could use a service discovery mechanism (for example: etcd or consul) to register Agents to be scraped, and then do either blackbox monitoring or set your Agent to scrape each other in a full or partial mesh.

This may require installing extra software, which is not ideal. I haven’t implemented this so let’s move on to the next solution.

Solution 5: node_exporter’s textfile collector

Using node_exporter’s textfile collector, we create a custom_server_info metric containing label values for all the instances we want to monitor. The purpose of this file is to provide the system with a database of nodes that are expected to be up. We distribute the file on all nodes so the information is not on a single point of failure. We use the Prometheus HA deduplication feature in Cortex to deduplicate the series. Then we use PromQL logic to synthesize a value of 0 when the up metric is absent.

I suggest using the metric name custom_server_info to align with naming best practices.

We will also want to add some labels that will later override the external labels set by the Agent. The custom metric should have labels with a common custom prefix for easy relabeling.

Here is an example of a custom metric with labels prefixed with server_ (newlines added for clarity):

custom_server_info{server_agent_hostname="host01",
                   server_cluster="...",
                   server_environment="...",
                   etc.,etc.,} 1

You can use this write_relabel_configs in the remote_write section of the Grafana Cloud Agent configuration:

remote_write:
  write_relabel_configs:
    - regex: "server_(.*)"
        replacement: "$1"
        action: labelmap
    - regex: "server_.*"
        action: labeldrop

Using the labelmap action, we extract the part after server_ and overwrite the labels without the server_ prefix.

We now have multiple instances of Grafana Cloud Agent sending the same time series but with a different instance label value (see replace_instance_label and use_hostname_label in the agent configuration reference).

There are two ways to deal with this duplication of series:

  1. When writing queries, aggregate by (agent_hostname). This is not ideal because this grouping needs to be done on every query.
  2. Use the Prometheus HA deduplication mechanism built into Cortex and Grafana Cloud Metrics.

The deduplication of Prometheus HA pairs is introduced in another blog article and is documented in Grafana Cloud documentation and in Cortex documentation

In Grafana Cloud, Prometheus HA deduplication is enabled by default. All you need to do is to add the cluster and __replica__ labels to your samples.

The usual way to add cluster and __replica__ labels is in the external_labels section. However, if we do that, all the metrics sent from every Grafana Cloud Agent except the one elected as master will be dropped. We only want to deduplicate the custom_server_info metric.

Adding the cluster and __replica__ labels in the .prom file read by the textfile collector won’t work. Label names starting with a double underscore (__) are reserved for internal use. You can set them inside the Prometheus process with relabel configs, but you can’t set them in exporters.

Furthermore, deduplication happens on a per series basis, but only the first sample in a write request is considered for checking the presence of the deduplication labels. If the labels are present in the first sample, the deduplication code engages assuming all the samples in the read request have the labels also. All the samples in a write request must either have or not have the deduplication labels. This is because this feature was not designed with the use-case of sending some metrics with deduplication and some metrics without deduplication.

The solution around this is to configure two remote_write targets: one which drops every metric except custom_server_info and adds the cluster and __replica__ labels, and one which sends everything else without the cluster and __replica__ labels.

integrations:
  agent:
    enabled: true
  node_exporter:
    enabled: true
    textfile_directory: /var/local/node_exporter
  prometheus_remote_write:
    - url: https://prometheus-us-central1.grafana.net/api/prom/push
      basic_auth:
        username: $INSTANCE_ID
        password: $API_TOKEN
      write_relabel_configs:
        - source_labels: [__name__]
          regex: "custom_server_info"
          action: drop
    - url: https://prometheus-us-central1.grafana.net/api/prom/push
      basic_auth:
        username: $INSTANCE_ID
        password: $API_TOKEN
      write_relabel_configs:
        - source_labels: [__name__]
          regex: "custom_server_info"
          action: keep
        - target_label: cluster
          replacement: global
        - source_labels: [agent_hostname]
          target_label: __replica__
        - regex: "server_(.*)"
          replacement: "$1"
          action: labelmap
        - regex: "server_.*"
          action: labeldrop

Now here are the queries that you can use with your custom_server_info metric. They mix the up{job="integrations/agent"} and custom_server_info metrics to return vectors with 0 values when the up values are missing.

A query that returns 0 or 1 values (instant up or down check):

 custom_server_info * 0
unless on (agent_hostname)
  up{job="integrations/agent"}
or on (agent_hostname)
  custom_server_info

You can add the following recording rule in your Grafana Cloud Prometheus rules to simplify the alert and graphing queries. This saves the result of the above expression into a new metric named agent:custom_server_info:up which you can then use in alert queries and Grafana panels.

groups:
  - name: Grafana Cloud Agent metamonitoring
    rules:
    - record: agent:custom_server_info:up
      expr: |2
          custom_server_info * 0
        unless on (agent_hostname)
          up{job="integrations/agent"}
        or on (agent_hostname)
          custom_server_info

Without such a recording rule, you may find the following PromQL expressions useful in various use-cases.

A query that returns 0 or 1 values with a label selector:

  custom_server_info{agent_hostname=~"$host"} * 0
unless on (agent_hostname)
  up{job="integrations/agent"}
or on (agent_hostname)
  custom_server_info{agent_hostname=~"$host"}

Because of the “or” boolean logic, if you want to query the availability of a subset of hosts, you need to add the filter on both sides of the “or” operator.

An alert expression that returns a vector (and fires an alert) when a host is down:

custom_server_info unless on (agent_hostname) up{job="integrations/agent"}

A query that computes the %uptime over a time period:

(
  count_over_time(
    (
      up{job="integrations/agent"}
      and on (agent_hostname)
      custom_server_info
    )[$__interval:]
  )
  or on (agent_hostname)
  0 * custom_server_info
)
/ on (agent_hostname)
count_over_time(custom_server_info[$__interval:])

We have to “or” the numerator with “0 * custom_server_info” because the count_over_time() function won’t return a vector for series that have 0 data points during the selected time range.

In the above expression, we used PromQL subqueries (The [1d:] syntax) to specify the sampling interval (defaults to the global evaluation interval) in case the integrations/node_exporter and integrations/agent scrape jobs have different scrape intervals.

A query that returns %uptime with a label selector:

(
  count_over_time(
    (
      up{job="integrations/agent"}
      and on (agent_hostname)
      custom_server_info{agent_hostname=~"$host"}
    )[$__interval:]
  )
  or on (agent_hostname)
  0 * custom_server_info{agent_hostname=~"$host"}
)
/ on (agent_hostname)
count_over_time(custom_server_info[$__interval:])

Because of the “or” boolean logic, if you want to query the availability of a subset of hosts, you need to add the filter on both sides of the “or” operator.

Engineering roadmap

I sought input about the future of the Grafana Cloud Agent from our engineering team:

The long-term goal is to make the default and recommended operational mode of the Grafana Cloud Agent to be clustered via a second iteration on the scraping service mode. With this in place, you’d be able to alert on Agents being offline by using their state in the cluster. If they unregister themselves from the cluster, it’s a natural shutdown. Otherwise if they’re unhealthy, it should generate an alert.

There is currently no ETA for the second iteration of Agent clustering, but some efforts are being put into it.

Conclusion

Monitoring a host, device, or service uptime is best done with black box monitoring or with a pull scraping model. The Grafana Cloud Agent reverses the Prometheus pull model into a push model without a database of expected targets. Monitoring up/downtime requires centralized knowledge of which entities are expected to exist. A static file dropped in the textfile collector directory and some complex Agent configuration and PromQL expressions can do the trick. Without such a database file, a clever query that “notices” the sudden disappearance of a time series and then settles back to normality after a set amount of time could also satisfy your requirement.