Monitor Grafana Agent
Grafana Agent uses a subset of Prometheus code. The Prometheus documentation provides a list of best practices related to monitoring Prometheus itself. This is called meta-monitoring.
Typically, Prometheus pulls metrics. The Grafana Agent reverses this into a push model meaning the agent installed on a monitoring target scrapes metrics and pushes metrics to the remote monitoring system versus that remote monitoring system polling (or pulling) metrics from a set of defined targets, as is the case with non-agent Prometheus.
With a pull model, it is straightforward to determine whether a node is available using an up metric with a value of 1 when the target is reachable and 0 when it is not.
With the agent’s push model, the up metric has a value of 1 when the agent is running and no value at all when it is not.
This distinction is important when determining whether your monitoring system is up or down now. It is also needed for determining whether it is running as expected over time, where computing a percentage of uptime requires both the 1 and 0 values using a sum of all values in a given time period divided by the count of values over that same time. The problem with the absence of data is the alerting engine depends on the alert expression returning something in order to fire an alert.
Monitoring a host, device, or service uptime is typically done with black box monitoring or with a pull scraping model. The Grafana Agent reverses the Prometheus pull model into a push model without a database of expected targets. Therefore, monitoring up/downtime requires centralized knowledge of which entities are expected to exist. A static file dropped in the textfile collector directory and some complex Agent configuration and PromQL expressions can do the trick. Without such a database file, a clever query that “notices” the sudden disappearance of a time series and then settles back to normality after a set amount of time could also satisfy your requirement.
Following are some methods you can use or adapt to monitor the Grafana Agent, all originally developed as part of an article by Alexandre de Verteuil and posted in the Grafana blog.
In all cases, the prerequisite is that you have deployed and configured the Grafana Agent.
Method 1: Use PromQL to create an alert
First, enable the agent integration in the configuration to ensure the metric we are about to use is reported by adding the following to the integrations section of the agent configuration YAML file, if it is not already present.
integrations:
agent:
enabled: true
Create an alert using this PromQL expression, which will return a vector for a time series that suddenly stops existing:
max_over_time(up{job="integrations/agent"}[1h]) unless up
Pros:
- Quick to implement
- Picks up a down host instantly
Cons:
- Only fires an alert for the duration of the range vector
- Any label values that change cause the expression to fire an alert
What this PromQL expression collects
The range vector in max_over_time() which will catch all series that have existed in that time range. The particular
We use the unless logical operator which is described in the Prometheus documentation:
vector1 unless vector2 results in a vector consisting of the elements of vector1 for which there are no elements in vector2 with exactly matching label sets. All matching elements in both vectors are dropped.
What this PromQL expression evaluates
If a series was present in the past hour and is currently present, it is not returned in the result set. If a vector was present within the past hour and is currently not present, it is returned in the result set.
Thus, series that have existed recently and do not currently exist will be returned with a value of 1.
The alert will fire for at most 1 hour after the metric stops being reported, or whatever time range you set in the alert expression.
Method 2: Use the absent() function
In this method, we use a the absent() function with a templating engine and some automation to create a rule group containing an alert rule for each thing we need to monitor.
From the Prometheus documentation:
The absent() function returns a 1-element vector with the value 1 if the vector passed to it has no elements. This is useful for alerting on when no time series exist for a given metric name and label combination.
There are a couple ways to go about this.
Use the absent_alert mixin to programmatically create a group of alert rules
For this, use the absent_alert mixin in Kubernetes to create an individual Prometheus alert for each component that needs to be alerted on when they stop reporting.
Configure an alert rule group
Alternately, create a script that would generate the absent() alert rules to monitor agents, perhaps something that would have output like this:
groups:
- name: HostAvailability
rules:
- alert: HostAvailability
expr: absent(up{agent_hostname="host01",environment="dev",region="us"})
for: 10m
labels:
severity: critical
annotations:
description: |-
Host availability for VM resources
VALUE = {{ $value }}
LABELS: {{ $labels }}
summary: Host is down (host01)
- alert: HostAvailability
expr: absent(up{agent_hostname="host02",environment="prod",region="au"})
for: 10m
labels:
severity: critical
annotations:
description: |-
Host availability for VM resources
VALUE = {{ $value }}
LABELS: {{ $labels }}
summary: Host is down (host02)
- alert: HostAvailability
...
Query against collected metrics
Here is an example query to calculate the percentage of uptime over one day:
count_over_time(up{job="integrations/agent"}[1d:])
/ on () group_left
count_over_time(vector(1)[1d:])
Note: For this to work, every instance being monitored must have been up for at least one sampling interval during the time range selected. Otherwise, you will have to hardcode the label values in as many queries as you have instances.
Here’s an example of hardcoding the label values in a query:
sum by (agent_hostname) (
sum_over_time(
(
0 * absent(up{job="integrations/agent",agent_hostname="host01"})
or
up{job="integrations/agent",agent_hostname="host01"}
)[1d:]
)
)
/ on () group_left
count_over_time(vector(1)[1d:])
Method 3: Use a blackbox exporter
You can do an ICMP check such as with the Prometheus blackbox_exporter. This is a return to the pull model, and in a way, it defeats the purpose of using the Grafana Agent. Also, in some environments, it may not be possible due to network firewall restrictions.
For this, create something similar to Alexandre’s blog post titled Supercharge your blackbox_exporter modules.
Method 4: Use node_exporter’s textfile collector
Here we use the textfile collector from the Prometheus node_exporter and create a custom_server_info
metric containing label values for all the instances we want to monitor. The purpose of this file is to provide the system with a database of nodes that are expected to be up. We distribute the file on all nodes so the information is not on a single point of failure. We use the Prometheus HA deduplication feature in Cortex to deduplicate the series. Then we use PromQL logic to synthesize a value of 0 when the up metric is absent.
This example uses the metric name custom_server_info
to align with naming best practices. It adds some labels that will later override the external labels set by the Grafana Agent. Create labels with a common custom prefix for easy relabeling.
Create a custom metric
Here is an example of a custom metric with labels prefixed with server_
(newlines added for clarity):
custom_server_info{server_agent_hostname="host01",
server_cluster="...",
server_environment="...",
etc.,etc.,} 1
In the remote_write
section of the Grafana Agent configuration, add write_relabel_configs
configuration info, like this:
remote_write:
write_relabel_configs:
- regex: "server_(.*)"
replacement: "$1"
action: labelmap
- regex: "server_.*"
action: labeldrop
This configuration uses the labelmap
action to extract the part after server_
and overwrite the labels, omitting the server_
prefix.
Note: We first create a custom metric with the server_
prefix and then use relabeling rules to drop that prefix later. The following paragraphs explain why.
According to the Prometheus documentation, write relabeling is applied after external labels. This means that write_relabel_configs
are run after the external labels are added. This allows us to override the external labels by relabeling the server_*
metric labels.
Our intent here is to ensure the cluster label from the text file can override what’s set in the agent config. With meta-monitoring you don’t want the existing config with external labels to have precedence because to be useful you want only the information about designated hosts. The file may have information about other hosts that exist in other regions, such as the host that may have forwarded this monitoring information via remote_write
.
For example, if we had three servers A, B, and C:
- Server A in region EU sets its external label
region="EU"
for most of its own metrics. - When sending the
custom_server_info
metrics about server B in US-East and server C in APAC, server A’s region label is irrelevant. - Server B and C also send the same
custom_server_info
about server A, B and C. For meta-monitoring, none of their region labels are relevant. You want to specify the region label in the text file.
So, to reiterate, in order to accomplish what we want we must first prefix the server_
label and then do some relabeling after external labels are applied.
Handling duplicate series
We now have multiple instances of Grafana Agent sending the same time series but with a different instance
label value (see replace_instance_label
and use_hostname_label
in the agent configuration reference).
There are two ways to deal with this duplication of series:
- When writing queries, aggregate
by (agent_hostname)
. This is not ideal because this grouping needs to be done on every query. - Use the Prometheus HA deduplication mechanism built into Cortex and Grafana Cloud Metrics.
The deduplication of Prometheus HA pairs is introduced in another blog article and is documented in Grafana Cloud documentation and in Cortex documentation.
In Grafana Cloud, Prometheus HA deduplication is enabled by default. All you need to do is to add the cluster
and __replica__
labels to your samples.
Avoid a deduplication pitfall
The usual way to add cluster and __replica__
labels is in the external_labels
section. However, if we do that, all the metrics sent from every Grafana Agent except the one elected as master will be dropped. We only want to deduplicate the custom_server_info
metric.
Adding the cluster and __replica__
labels in the .prom
file read by the textfile collector won’t work. Label names starting with a double underscore (__
) are reserved for internal use. You can set them inside the Prometheus process with relabel configs, but you can’t set them in exporters.
Furthermore, deduplication happens on a per series basis, but only the first sample in a write request is considered for checking the presence of the deduplication labels. If the labels are present in the first sample, the deduplication code engages assuming all the samples in the read request have the labels also. All the samples in a write request must either have or not have the deduplication labels. This is because this feature was not designed with the use-case of sending some metrics with deduplication and some metrics without deduplication.
The solution around this is to configure two remote_write
targets: one which drops every metric except custom_server_info
and adds the cluster
and __replica__
labels, and one which sends everything else without the cluster
and __replica__
labels.
integrations:
agent:
enabled: true
node_exporter:
enabled: true
textfile_directory: /var/local/node_exporter
prometheus_remote_write:
- url: <Your Metrics instance remote_write endpoint>
basic_auth:
username: $INSTANCE_ID
password: $API_TOKEN
write_relabel_configs:
- source_labels: [__name__]
regex: "custom_server_info"
action: drop
- url: <Your Metrics instance remote_write endpoint>
basic_auth:
username: $INSTANCE_ID
password: $API_TOKEN
write_relabel_configs:
- source_labels: [__name__]
regex: "custom_server_info"
action: keep
- target_label: cluster
replacement: global
- source_labels: [agent_hostname]
target_label: __replica__
- regex: "server_(.*)"
replacement: "$1"
action: labelmap
- regex: "server_.*"
action: labeldrop
- regex: "instance"
action: labeldrop
It’s important that each agent is sending the same custom_server_info
metrics with the same labels. If any per-agent labels are being applied, they should be dropped in the second write_relabel_configs
section. Otherwise, when a new leader is elected by Prometheus HA deduplication, new series will be created with the unique labels.
You can find the /api/prom/push
URL, username, and password for your metrics endpoint by clicking on Details in the Prometheus card of the Cloud Portal.
Query against collected metrics
Now here are the queries that you can use with your custom_server_info
metric. They mix the up{job="integrations/agent"}
and custom_server_info
metrics to return vectors with 0 values when the up
values are missing.
A query that returns 0 or 1 values (instant up or down check):
custom_server_info * 0
unless on (agent_hostname)
up{job="integrations/agent"}
or on (agent_hostname)
custom_server_info
You can add the following recording rule in your Grafana Cloud Prometheus rules to simplify the alert and graphing queries. This saves the result of the above expression into a new metric named agent:custom_server_info:up
which you can then use in alert queries and Grafana panels.
groups:
- name: Grafana Agent metamonitoring
rules:
- record: agent:custom_server_info:up
expr: custom_server_info * 0 unless on (agent_hostname) up{job="integrations/agent"} or on (agent_hostname) custom_server_info
This instant query expression calculates the percentage of uptime over a period of time:
avg_over_time(agent:custom_server_info:up[$__range])
The $__range
variable refers to the currently selected dashboard time range.
Without such a recording rule, you may find the following PromQL expressions useful in various use-cases.
A query that returns 0 or 1 values with a label selector:
custom_server_info{agent_hostname=~"$host"} * 0
unless on (agent_hostname)
up{job="integrations/agent"}
or on (agent_hostname)
custom_server_info{agent_hostname=~"$host"}
Because of the “or” boolean logic, if you want to query the availability of a subset of hosts, you need to add the filter on both sides of the “or” operator.
An alert expression that returns a vector (and fires an alert) when a host is down:
custom_server_info unless on (agent_hostname) up{job="integrations/agent"}
A query that computes the percentage of uptime over a period of time:
(
count_over_time(
(
up{job="integrations/agent"}
and on (agent_hostname)
custom_server_info
)[$__range:]
)
or on (agent_hostname)
0 * custom_server_info
)
/ on (agent_hostname)
count_over_time(custom_server_info[$__range:])
We have to “or” the numerator with “0 * custom_server_info
” because the count_over_time()
function won’t return a vector for series that have 0 data points during the selected time range.
In the above expression, we used PromQL subqueries (The [1d:]
syntax) to specify the sampling interval (defaults to the global evaluation interval) in case the integrations/node_exporter
and integrations/agent
scrape jobs have different scrape intervals.
A query that returns percentage of uptime with a label selector:
(
count_over_time(
(
up{job="integrations/agent"}
and on (agent_hostname)
custom_server_info{agent_hostname=~"$host"}
)[$__range:]
)
or on (agent_hostname)
0 * custom_server_info{agent_hostname=~"$host"}
)
/ on (agent_hostname)
count_over_time(custom_server_info[$__range:])