Grafana Cloud AgentMonitoring the Grafana Cloud Agent

Monitoring the Grafana Cloud Agent

The Grafana Cloud Agent uses a subset of Prometheus code. The Prometheus documentation provides a list of best practices related to monitoring Prometheus itself. This is called meta-monitoring.

Typically, Prometheus pulls metrics. The Grafana Cloud Agent reverses this into a push model meaning the agent installed on a monitoring target scraps metrics and pushes metrics to the remote monitoring system versus that remote monitoring system polling (or pulling) metrics from a set of defined targets, as is the case with non-agent Prometheus.

With a pull model, it is straightforward to determine whether a node is available using an up metric with a value of 1 when the target is reachable and 0 when it is not.

With the agent’s push model, the up metric has a value of 1 when the agent is running and no value at all when it is not.

This distinction is important when determining whether your monitoring system is up or down now. It is also needed for determining whether it is running as expected over time, where computing a percentage of uptime requires both the 1 and 0 values using a sum of all values in a given time period divided by the count of values over that same time. The problem with the absence of data is the alerting engine depends on the alert expression returning something in order to fire an alert.

Monitoring a host, device, or service uptime is typically done with black box monitoring or with a pull scraping model. The Grafana Cloud Agent reverses the Prometheus pull model into a push model without a database of expected targets. Therefore, monitoring up/downtime requires centralized knowledge of which entities are expected to exist. A static file dropped in the textfile collector directory and some complex Agent configuration and PromQL expressions can do the trick. Without such a database file, a clever query that “notices” the sudden disappearance of a time series and then settles back to normality after a set amount of time could also satisfy your requirement.

Following are some methods you can use or adapt to monitor the Grafana Cloud Agent, all originally developed as part of an article by Alexandre de Verteuil and posted in the Grafana blog.

In all cases, the prerequisite is that you have deployed and configured the Grafana Cloud Agent.

Method 1: Use PromQL to create an alert

First, enable the agent integration in the configuration to ensure the metric we are about to use is reported by adding the following to the integrations section of the agent configuration YAML file, if it is not already present.

integrations:
  agent:
    enabled: true

Create a Grafana Cloud Alert using this PromQL expression, which will return a vector for a time series that suddenly stops existing:

max_over_time(up{job="integrations/agent"}[1h]) unless up

Pros:

  • Quick to implement
  • Picks up a down host instantly

Cons:

  • Only fires an alert for the duration of the range vector
  • Any label values that change cause the expression to fire an alert

What this PromQL expression collects

The range vector in max_over_time() which will catch all series that have existed in that time range. The particular _over_time() function that we use doesn’t really matter because up always has a value of 1 here.

We use the unless logical operator which is described in the Prometheus documentation:

vector1 unless vector2 results in a vector consisting of the elements of vector1 for which there are no elements in vector2 with exactly matching label sets. All matching elements in both vectors are dropped.

What this PromQL expression evaluates

If a series was present in the past hour and is currently present, it is not returned in the result set. If a vector was present within the past hour and is currently not present, it is returned in the result set.

Thus, series that have existed recently and do not currently exist will be returned with a value of 1.

The alert will fire for at most 1 hour after the metric stops being reported, or whatever time range you set in the alert expression.

Method 2: Use the absent() function

In this method, we use a the absent() function with a templating engine and some automation to create a rule group containing an alert rule for each thing we need to monitor.

From the Prometheus documentation:

The absent() function returns a 1-element vector with the value 1 if the vector passed to it has no elements. This is useful for alerting on when no time series exist for a given metric name and label combination.

There are a couple ways to go about this.

Use the absent_alert mixin to programmatically create a group of alert rules

For this, use the absent_alert mixin in Kubernetes to create an individual Prometheus alert for each component that needs to be alerted on when they stop reporting.

Configure an alert rule group

Alternately, create a script that would generate the absent() alert rules to monitor agents, perhaps something that would have output like this:

groups:
  - name: HostAvailability
    rules:
      - alert: HostAvailability
        expr: absent(up{agent_hostname="host01",environment="dev",region="us"})
        for: 10m
        labels:
          severity: critical
        annotations:
          description: |-
            Host availability for VM resources
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
          summary: Host is down (host01)
      - alert: HostAvailability
        expr: absent(up{agent_hostname="host02",environment="prod",region="au"})
        for: 10m
        labels:
          severity: critical
        annotations:
          description: |-
            Host availability for VM resources
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
          summary: Host is down (host02)
      - alert: HostAvailability
        ...

Query against collected metrics

Here is an example query to calculate the percentage of uptime over one day:

count_over_time(up{job="integrations/agent"}[1d:])
/ on () group_left
count_over_time(vector(1)[1d:])

Note: For this to work, every instance being monitored must have been up for at least one sampling interval during the time range selected. Otherwise, you will have to hardcode the label values in as many queries as you have instances.

Here’s an example of hardcoding the label values in a query:

sum by (agent_hostname) (
  sum_over_time(
    (
    0 * absent(up{job="integrations/agent",agent_hostname="host01"})
    or
    up{job="integrations/agent",agent_hostname="host01"}
    )[1d:]
  )
)
/ on () group_left
count_over_time(vector(1)[1d:])

Method 3: Use a blackbox exporter

You can do an ICMP check such as with the Prometheus blackbox_exporter. This is a return to the pull model, and in a way, it defeats the purpose of using the Grafana Cloud Agent. Also, in some environments, it may not be possible due to network firewall restrictions.

For this, create something similar to Alexandre’s blog post titled Supercharge your blackbox_exporter modules.