Grafana AgentMonitoring the Grafana Agent

Monitoring the Grafana Agent

The Grafana Agent uses a subset of Prometheus code. The Prometheus documentation provides a list of best practices related to monitoring Prometheus itself. This is called meta-monitoring.

Typically, Prometheus pulls metrics. The Grafana Agent reverses this into a push model meaning the agent installed on a monitoring target scrapes metrics and pushes metrics to the remote monitoring system versus that remote monitoring system polling (or pulling) metrics from a set of defined targets, as is the case with non-agent Prometheus.

With a pull model, it is straightforward to determine whether a node is available using an up metric with a value of 1 when the target is reachable and 0 when it is not.

With the agent’s push model, the up metric has a value of 1 when the agent is running and no value at all when it is not.

This distinction is important when determining whether your monitoring system is up or down now. It is also needed for determining whether it is running as expected over time, where computing a percentage of uptime requires both the 1 and 0 values using a sum of all values in a given time period divided by the count of values over that same time. The problem with the absence of data is the alerting engine depends on the alert expression returning something in order to fire an alert.

Monitoring a host, device, or service uptime is typically done with black box monitoring or with a pull scraping model. The Grafana Agent reverses the Prometheus pull model into a push model without a database of expected targets. Therefore, monitoring up/downtime requires centralized knowledge of which entities are expected to exist. A static file dropped in the textfile collector directory and some complex Agent configuration and PromQL expressions can do the trick. Without such a database file, a clever query that “notices” the sudden disappearance of a time series and then settles back to normality after a set amount of time could also satisfy your requirement.

Following are some methods you can use or adapt to monitor the Grafana Agent, all originally developed as part of an article by Alexandre de Verteuil and posted in the Grafana blog.

In all cases, the prerequisite is that you have deployed and configured the Grafana Agent.

Method 1: Use PromQL to create an alert

First, enable the agent integration in the configuration to ensure the metric we are about to use is reported by adding the following to the integrations section of the agent configuration YAML file, if it is not already present.

integrations:
  agent:
    enabled: true

Create a Grafana Cloud Alert using this PromQL expression, which will return a vector for a time series that suddenly stops existing:

max_over_time(up{job="integrations/agent"}[1h]) unless up

Pros:

  • Quick to implement
  • Picks up a down host instantly

Cons:

  • Only fires an alert for the duration of the range vector
  • Any label values that change cause the expression to fire an alert

What this PromQL expression collects

The range vector in max_over_time() which will catch all series that have existed in that time range. The particular _over_time() function that we use doesn’t really matter because up always has a value of 1 here.

We use the unless logical operator which is described in the Prometheus documentation:

vector1 unless vector2 results in a vector consisting of the elements of vector1 for which there are no elements in vector2 with exactly matching label sets. All matching elements in both vectors are dropped.

What this PromQL expression evaluates

If a series was present in the past hour and is currently present, it is not returned in the result set. If a vector was present within the past hour and is currently not present, it is returned in the result set.

Thus, series that have existed recently and do not currently exist will be returned with a value of 1.

The alert will fire for at most 1 hour after the metric stops being reported, or whatever time range you set in the alert expression.

Method 2: Use the absent() function

In this method, we use a the absent() function with a templating engine and some automation to create a rule group containing an alert rule for each thing we need to monitor.

From the Prometheus documentation:

The absent() function returns a 1-element vector with the value 1 if the vector passed to it has no elements. This is useful for alerting on when no time series exist for a given metric name and label combination.

There are a couple ways to go about this.

Use the absent_alert mixin to programmatically create a group of alert rules

For this, use the absent_alert mixin in Kubernetes to create an individual Prometheus alert for each component that needs to be alerted on when they stop reporting.

Configure an alert rule group

Alternately, create a script that would generate the absent() alert rules to monitor agents, perhaps something that would have output like this:

groups:
  - name: HostAvailability
    rules:
      - alert: HostAvailability
        expr: absent(up{agent_hostname="host01",environment="dev",region="us"})
        for: 10m
        labels:
          severity: critical
        annotations:
          description: |-
            Host availability for VM resources
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
          summary: Host is down (host01)
      - alert: HostAvailability
        expr: absent(up{agent_hostname="host02",environment="prod",region="au"})
        for: 10m
        labels:
          severity: critical
        annotations:
          description: |-
            Host availability for VM resources
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
          summary: Host is down (host02)
      - alert: HostAvailability
        ...

Query against collected metrics

Here is an example query to calculate the percentage of uptime over one day:

count_over_time(up{job="integrations/agent"}[1d:])
/ on () group_left
count_over_time(vector(1)[1d:])

Note: For this to work, every instance being monitored must have been up for at least one sampling interval during the time range selected. Otherwise, you will have to hardcode the label values in as many queries as you have instances.

Here’s an example of hardcoding the label values in a query:

sum by (agent_hostname) (
  sum_over_time(
    (
    0 * absent(up{job="integrations/agent",agent_hostname="host01"})
    or
    up{job="integrations/agent",agent_hostname="host01"}
    )[1d:]
  )
)
/ on () group_left
count_over_time(vector(1)[1d:])

Method 3: Use a blackbox exporter

You can do an ICMP check such as with the Prometheus blackbox_exporter. This is a return to the pull model, and in a way, it defeats the purpose of using the Grafana Agent. Also, in some environments, it may not be possible due to network firewall restrictions.

For this, create something similar to Alexandre’s blog post titled Supercharge your blackbox_exporter modules.

Method 4: Use node_exporter’s textfile collector

Here we use the textfile collector from the Prometheus node_exporter and create a custom_server_info metric containing label values for all the instances we want to monitor. The purpose of this file is to provide the system with a database of nodes that are expected to be up. We distribute the file on all nodes so the information is not on a single point of failure. We use the Prometheus HA deduplication feature in Cortex to deduplicate the series. Then we use PromQL logic to synthesize a value of 0 when the up metric is absent.

This example uses the metric name custom_server_info to align with naming best practices. It adds some labels that will later override the external labels set by the Grafana Agent. Create labels with a common custom prefix for easy relabeling.

Create a custom metric

Here is an example of a custom metric with labels prefixed with server_ (newlines added for clarity):

custom_server_info{server_agent_hostname="host01",
                   server_cluster="...",
                   server_environment="...",
                   etc.,etc.,} 1

In the remote_write section of the Grafana Agent configuration, add write_relabel_configs configuration info, like this:

remote_write:
  write_relabel_configs:
    - regex: "server_(.*)"
        replacement: "$1"
        action: labelmap
    - regex: "server_.*"
        action: labeldrop

This configuration uses the labelmap action to extract the part after server_ and overwrite the labels, omitting the server_ prefix.

Note: We first create a custom metric with the server_ prefix and then use relabeling rules to drop that prefix later. The following paragraphs explain why.

According to the Prometheus documentation, write relabeling is applied after external labels. This means that write_relabel_configs are run after the external labels are added. This allows us to override the external labels by relabeling the server_* metric labels.

Our intent here is to ensure the cluster label from the text file can override what’s set in the agent config. With meta-monitoring you don’t want the existing config with external labels to have precedence because to be useful you want only the information about designated hosts. The file may have information about other hosts that exist in other regions, such as the host that may have forwarded this monitoring information via remote_write.

For example, if we had three servers A, B, and C:

  • Server A in region EU sets its external label region="EU" for most of its own metrics.
  • When sending the custom_server_info metrics about server B in US-East and server C in APAC, server A’s region label is irrelevant.
  • Server B and C also send the same custom_server_info about server A, B and C. For meta-monitoring, none of their region labels are relevant. You want to specify the region label in the text file.

So, to reiterate, in order to accomplish what we want we must first prefix the server_ label and then do some relabeling after external labels are applied.

Handling duplicate series

We now have multiple instances of Grafana Agent sending the same time series but with a different instance label value (see replace_instance_label and use_hostname_label in the agent configuration reference).

There are two ways to deal with this duplication of series:

  1. When writing queries, aggregate by (agent_hostname). This is not ideal because this grouping needs to be done on every query.
  2. Use the Prometheus HA deduplication mechanism built into Cortex and Grafana Cloud Metrics.

The deduplication of Prometheus HA pairs is introduced in another blog article and is documented in Grafana Cloud documentation and in Cortex documentation.

In Grafana Cloud, Prometheus HA deduplication is enabled by default. All you need to do is to add the cluster and __replica__ labels to your samples.

Avoid a deduplication pitfall

The usual way to add cluster and __replica__ labels is in the external_labels section. However, if we do that, all the metrics sent from every Grafana Agent except the one elected as master will be dropped. We only want to deduplicate the custom_server_info metric.

Adding the cluster and __replica__ labels in the .prom file read by the textfile collector won’t work. Label names starting with a double underscore (__) are reserved for internal use. You can set them inside the Prometheus process with relabel configs, but you can’t set them in exporters.

Furthermore, deduplication happens on a per series basis, but only the first sample in a write request is considered for checking the presence of the deduplication labels. If the labels are present in the first sample, the deduplication code engages assuming all the samples in the read request have the labels also. All the samples in a write request must either have or not have the deduplication labels. This is because this feature was not designed with the use-case of sending some metrics with deduplication and some metrics without deduplication.

The solution around this is to configure two remote_write targets: one which drops every metric except custom_server_info and adds the cluster and __replica__ labels, and one which sends everything else without the cluster and __replica__ labels.

integrations:
  agent:
    enabled: true
  node_exporter:
    enabled: true
    textfile_directory: /var/local/node_exporter
  prometheus_remote_write:
    - url: <Your Metrics instance remote_write endpoint>
      basic_auth:
        username: $INSTANCE_ID
        password: $API_TOKEN
      write_relabel_configs:
        - source_labels: [__name__]
          regex: "custom_server_info"
          action: drop
    - url: <Your Metrics instance remote_write endpoint>
      basic_auth:
        username: $INSTANCE_ID
        password: $API_TOKEN
      write_relabel_configs:
        - source_labels: [__name__]
          regex: "custom_server_info"
          action: keep
        - target_label: cluster
          replacement: global
        - source_labels: [agent_hostname]
          target_label: __replica__
        - regex: "server_(.*)"
          replacement: "$1"
          action: labelmap
        - regex: "server_.*"
          action: labeldrop

You can find the /api/prom/push URL, username, and password for your metrics endpoint by clicking on Details in the Prometheus card of the Cloud Portal.

Query against collected metrics

Now here are the queries that you can use with your custom_server_info metric. They mix the up{job="integrations/agent"} and custom_server_info metrics to return vectors with 0 values when the up values are missing.

A query that returns 0 or 1 values (instant up or down check):

  custom_server_info * 0
unless on (agent_hostname)
  up{job="integrations/agent"}
or on (agent_hostname)
  custom_server_info

You can add the following recording rule in your Grafana Cloud Prometheus rules to simplify the alert and graphing queries. This saves the result of the above expression into a new metric named agent:custom_server_info:up which you can then use in alert queries and Grafana panels.

groups:
  - name: Grafana Agent metamonitoring
    rules:
    - record: agent:custom_server_info:up
      expr: custom_server_info * 0 unless on (agent_hostname) up{job="integrations/agent"} or on (agent_hostname) custom_server_info

This instant query expression calculates the percentage of uptime over a period of time:

avg_over_time(agent:custom_server_info:up[$__range])

The $__range variable refers to the currently selected dashboard time range.

Without such a recording rule, you may find the following PromQL expressions useful in various use-cases.

A query that returns 0 or 1 values with a label selector:

  custom_server_info{agent_hostname=~"$host"} * 0
unless on (agent_hostname)
  up{job="integrations/agent"}
or on (agent_hostname)
  custom_server_info{agent_hostname=~"$host"}

Because of the “or” boolean logic, if you want to query the availability of a subset of hosts, you need to add the filter on both sides of the “or” operator.

An alert expression that returns a vector (and fires an alert) when a host is down:

custom_server_info unless on (agent_hostname) up{job="integrations/agent"}

A query that computes the percentage of uptime over a period of time:

(
  count_over_time(
    (
      up{job="integrations/agent"}
      and on (agent_hostname)
      custom_server_info
    )[$__range:]
  )
  or on (agent_hostname)
  0 * custom_server_info
)
/ on (agent_hostname)
count_over_time(custom_server_info[$__range:])

We have to “or” the numerator with “0 * custom_server_info” because the count_over_time() function won’t return a vector for series that have 0 data points during the selected time range.

In the above expression, we used PromQL subqueries (The [1d:] syntax) to specify the sampling interval (defaults to the global evaluation interval) in case the integrations/node_exporter and integrations/agent scrape jobs have different scrape intervals.

A query that returns percentage of uptime with a label selector:

(
  count_over_time(
    (
      up{job="integrations/agent"}
      and on (agent_hostname)
      custom_server_info{agent_hostname=~"$host"}
    )[$__range:]
  )
  or on (agent_hostname)
  0 * custom_server_info{agent_hostname=~"$host"}
)
/ on (agent_hostname)
count_over_time(custom_server_info[$__range:])