Self monitoring
NOTE Self-monitoring is an experimental feature. As such, the configuration settings, command line flags, or specifics of the implementation are subject to change.
Overview
Since version 1.4, Grafana Enterprise Metrics (GEM) includes the ability to directly record self-monitoring metrics to
allow you to easily monitor the health and stability of GEM itself. The metrics GEM collects about itself are written to
a built-in __system__
tenant. The metrics written can be queried as usual using tokens created under the
built-in __system__
access policy. Since version 1.8, GEM directly records exemplars
as part of self-monitoring metrics.
The way self-monitoring works ensures that any metrics available from GEM via /metrics
endpoints will be available
directly in GEM without needing to be scraped by an external process. While these metrics would ordinarily need to be
scraped using Prometheus or the
Grafana Agent, with self-monitoring they will be available after
following the quick setup described below.
This feature provides a simple, out-of-the-box way to monitor GEM itself with a minimum amount of configuration or extra dependencies. To get the maximum value of this feature, we recommend you install GEM’s Grafana plug-in, which automatically provisions a set of dashboards that use the self-monitoring metrics. The dashboards are in line with Grafana Labs’ best practices for understanding GEM system health. Self-monitoring is compatible with plugin versions >= 3.0.4 (which require Grafana 8). Grafana 7.5 users should use version 2.1.1.
Configuration
The sections below describe the steps needed to set up self monitoring.
Single binary mode
Self-monitoring is enabled by default - no action is necessary in single binary mode!
Microservices mode
In order to use self-monitoring in microservices mode, you’ll need a hostname that you can use to
address the gRPC port (9095 by default) of each of the GEM distributors. This could be a load balancer that balances
between each distributor, a DNS A
record that includes IPs for each distributor, or a Kubernetes service that balances
between each gRPC port of the distributor pods. For the purposes of this example, we’ll assume that you are using a
Kubernetes service and GEM is running in a namespace called enterprise-metrics
.
Add the following section to your GEM configuration file used by each GEM pod or process.
instrumentation:
distributor_client:
address: dns:///distributor.enterprise-metrics.svc.cluster.local:9095
Or, you can alternatively add the command line flag to the arguments passed to each GEM pod or process.
-instrumentation.distributor-client.address='dns:///distributor.enterprise-metrics.svc.cluster.local:9095'
What is described above will give you system health metrics about the entire GEM cluster. To better understand GEM behavior, you also want to understand resource usage at a per-tenant level. In order to get the self-monitoring metrics you need to understand this behavior (and populate the “Per Tenant Usage” dashboards provisioned by the GEM plugin), you must also deploy the overrides-exporter component.
Exemplars
Since GEM 1.8, self-monitoring has the ability to directly record exemplars.
However, recording of the exemplars under the __system__
tenant is still controlled by the same
limits applied to all other tenants. This means that recording of
exemplars for the __system__
tenant is disabled by default (as it is for all tenants) and must be enabled using the
runtime configuration file or enabled globally.
Since the __system__
tenant is built into GEM itself and immutable, limits for it (such as enabling exemplars)
cannot be set using the Admin API. Instead, if you wish to emit exemplars for the __system__
tenant you must override
the max_global_exemplars_per_user
setting for the __system__
tenant using
the runtime configuration file or
enable exemplars globally.
Here is an example of using the runtime configuration file:
overrides:
__system__:
max_global_exemplars_per_user: 300000
Verification
After you’ve deployed the configuration changes above, you’ll need to verify that self-monitoring is working correctly. We’ll learn how to query the self-monitoring metrics later, but to verify they’re working we can check a simple counter incremented when self-monitoring metrics are emitted.
Pick a single pod or process that is part of your GEM cluster. For this example, we’ll assume that you have picked an ingester.
Make a curl
request to the /metrics
endpoint of the ingester.
$ curl -s 'http://ingester-01.example.com/metrics' | grep 'cortex_self_monitoring_pushes_total'
# HELP cortex_self_monitoring_pushes_total Number of successes pushing self-monitoring metrics
# TYPE cortex_self_monitoring_pushes_total counter
cortex_self_monitoring_pushes_total 15
NOTE If you are running GEM in a Kubernetes cluster, individual pods might not be directly accessible from outside the Kubernetes cluster. In this case you can make the request from another pod running in the Kubernetes cluster, or you can make use of the kubectl port-forward command.
If the metric above is 0
or doesn’t exist, check the logs for each GEM component looking for errors or warnings
related to pushing metrics to a distributor.
Querying
In order to query self-monitoring metrics directly, you’ll need to
create a token associated with the __system__
access policy. The steps
below assume you have already done this and copied down the token. The following examples further assume that your GEM
cluster is available at the host gem.example.com
over HTTPS.
First, set the token as a variable to use for the subsequent commands.
$ export API_TOKEN="the long token string you copied"
Next, we’ll make a request to the Prometheus query endpoint of GEM looking for a particular metric. In this case,
grafana_metrics_enterprise_build_info
$ curl -s -u "__system__:$API_TOKEN" "https://gem.example.com/prometheus/api/v1/query?query=grafana_metrics_enterprise_build_info" | jq
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "grafana_metrics_enterprise_build_info",
"branch": "gem-release-1.4",
"goversion": "go1.16.3",
"instance": "ingester-01:80",
"revision": "ccd12b7a",
"target": "ingester",
"version": "v1.4.1"
},
"value": [
1622833381.751,
"1"
]
},
{
"metric": {
"__name__": "grafana_metrics_enterprise_build_info",
"branch": "gem-release-1.4",
"goversion": "go1.16.3",
"instance": "distributor-01:80",
"revision": "ccd12b7a",
"target": "distributor",
"version": "v1.4.1"
},
"value": [
1622833381.751,
"1"
]
},
<...snip...>
]
}
}
As you can see, querying self-monitoring metrics with GEM is the same process as querying any other type of metrics.
Implementation
Though you don’t need to be familiar with how self-monitoring works at a technical level, it’s detailed below in the hopes that it’s useful.
Gathering
Self-monitoring metrics are gathered internally the same way metrics exposed via the /metrics
endpoint are: they are registered with a
Prometheus Registerer
on application start up. The metrics are updated during the normal course of running the application and periodically (
every 15 seconds by default) flushed directly to a distributor. Any metric available from the /metrics
endpoint of a
GEM component will also be available in the self-monitoring system.
The metrics are written to the distributor over its gRPC interface. This allows the self-monitoring system control over
the exact tenant the metrics are stored under. This enables it to cleanly separate system metrics (under
the __system__
tenant) from user data.
Injected labels
Normally, when metrics are scraped by Prometheus, labels are automatically added by Prometheus that identify where the metrics came from. Since self-monitoring metrics are not scraped by any external system, labels are automatically added internally to help identify which component the metrics came from.
The following labels are added to metrics emitted by the self-monitoring system.
instance
: this label is made up of the node or host name a component is running on in combination with the HTTP port used. For example a value for this label in a GEM cluster running on Kubernetes might beingester-1:80
orquerier-5bf6ddccd7-hzbtn:80
.target
: this label is made of a comma separated list of the targets a GEM process is running as (ingester
,querier
, etc.) orall
in single binary mode.
System tenant and access policy
In order to cleanly separate self-monitoring data from user data, GEM comes with a built-in __system__
tenant
and __system__
access policy. All self-monitoring data is written to the __system__
tenant. The self-monitoring
data may be queried using tokens associated with the__system__
access policy. Because these are built into GEM itself,
they cannot be removed. However, writing self-monitoring metrics to the system tenant can be turned off using the
flag -instrumentation.enabled=false
or the associated configuration setting.
Recording rules
In order to use self-monitoring metrics to power associated self-monitoring dashboards, the GEM ruler also includes
built-in recording rules. These recording rules perform aggregations of self-monitoring metrics they same way the ruler
aggregates other metrics. Because these recording rules are built-in to GEM itself, they cannot be removed. However,
they can be turned off using the same flag that enables or disables self-monitoring -instrumentation.enabled=false
or
the associated configuration setting.
Overhead
Self-monitoring metrics are stored in GEM itself. Like any other metrics, they consume space in object storage. When enabled in microservices mode, each GEM component (ingester, querier, etc) will emit approximately 2000 series per component. These series are emitted for each component and GEM duplicates them based on the replication factor in the ingesters.
To understand how many series will be written under the __system__
tenant as part of self-monitoring, you can use
the following formula:
2000 * $NUMBER_OF_GEM_PROCESSES * $REPLICATION_FACTOR
Since these series are written to GEM in a similar way to other series, they’ll be deduplicated by the compactor in
object storage to reduce space required. To understand how many series will end up in object storage via
the __system__
tenant, you can use the following formula:
2000 * $NUMBER_OF_GEM_PROCESSES