Menu

Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.

Enterprise RSS

Self monitoring

NOTE Self-monitoring is an experimental feature. As such, the configuration settings, command line flags, or specifics of the implementation are subject to change.

Overview

Since version 1.4, Grafana Enterprise Metrics (GEM) includes the ability to directly record self-monitoring metrics to allow you to easily monitor the health and stability of GEM itself. The metrics GEM collects about itself are written to a built-in __system__ tenant. The metrics written can be queried as usual using tokens created under the built-in __system__ access policy. Since version 1.8, GEM directly records exemplars as part of self-monitoring metrics.

The way self-monitoring works ensures that any metrics available from GEM via /metrics endpoints will be available directly in GEM without needing to be scraped by an external process. While these metrics would ordinarily need to be scraped using Prometheus or the Grafana Agent, with self-monitoring they will be available after following the quick setup described below.

This feature provides a simple, out-of-the-box way to monitor GEM itself with a minimum amount of configuration or extra dependencies. To get the maximum value of this feature, we recommend you install GEM’s Grafana plug-in, which automatically provisions a set of dashboards that use the self-monitoring metrics. The dashboards are in line with Grafana Labs’ best practices for understanding GEM system health. Self-monitoring is compatible with plugin versions >= 3.0.4 (which require Grafana 8). Grafana 7.5 users should use version 2.1.1.

Configuration

The sections below describe the steps needed to set up self monitoring.

Single binary mode

Self-monitoring is enabled by default - no action is necessary in single binary mode!

Microservices mode

In order to use self-monitoring in microservices mode, you’ll need a hostname that you can use to address the gRPC port (9095 by default) of each of the GEM distributors. This could be a load balancer that balances between each distributor, a DNS A record that includes IPs for each distributor, or a Kubernetes service that balances between each gRPC port of the distributor pods. For the purposes of this example, we’ll assume that you are using a Kubernetes service and GEM is running in a namespace called enterprise-metrics.

Add the following section to your GEM configuration file used by each GEM pod or process.

yaml
instrumentation:
  distributor_client:
    address: dns:///distributor.enterprise-metrics.svc.cluster.local:9095

Or, you can alternatively add the command line flag to the arguments passed to each GEM pod or process.

  • -instrumentation.distributor-client.address='dns:///distributor.enterprise-metrics.svc.cluster.local:9095'

What is described above will give you system health metrics about the entire GEM cluster. To better understand GEM behavior, you also want to understand resource usage at a per-tenant level. In order to get the self-monitoring metrics you need to understand this behavior (and populate the “Per Tenant Usage” dashboards provisioned by the GEM plugin), you must also deploy the overrides-exporter component.

Exemplars

Since GEM 1.8, self-monitoring has the ability to directly record exemplars. However, recording of the exemplars under the __system__ tenant is still controlled by the same limits applied to all other tenants. This means that recording of exemplars for the __system__ tenant is disabled by default (as it is for all tenants) and must be enabled using the runtime configuration file or enabled globally.

Since the __system__ tenant is built into GEM itself and immutable, limits for it (such as enabling exemplars) cannot be set using the Admin API. Instead, if you wish to emit exemplars for the __system__ tenant you must override the max_global_exemplars_per_user setting for the __system__ tenant using the runtime configuration file or enable exemplars globally.

Here is an example of using the runtime configuration file:

yaml
overrides:
  __system__:
    max_global_exemplars_per_user: 300000

Verification

After you’ve deployed the configuration changes above, you’ll need to verify that self-monitoring is working correctly. We’ll learn how to query the self-monitoring metrics later, but to verify they’re working we can check a simple counter incremented when self-monitoring metrics are emitted.

Pick a single pod or process that is part of your GEM cluster. For this example, we’ll assume that you have picked an ingester.

Make a curl request to the /metrics endpoint of the ingester.

$ curl -s 'http://ingester-01.example.com/metrics' | grep 'cortex_self_monitoring_pushes_total'
# HELP cortex_self_monitoring_pushes_total Number of successes pushing self-monitoring metrics
# TYPE cortex_self_monitoring_pushes_total counter
cortex_self_monitoring_pushes_total 15

NOTE If you are running GEM in a Kubernetes cluster, individual pods might not be directly accessible from outside the Kubernetes cluster. In this case you can make the request from another pod running in the Kubernetes cluster, or you can make use of the kubectl port-forward command.

If the metric above is 0 or doesn’t exist, check the logs for each GEM component looking for errors or warnings related to pushing metrics to a distributor.

Querying

In order to query self-monitoring metrics directly, you’ll need to create a token associated with the __system__ access policy. The steps below assume you have already done this and copied down the token. The following examples further assume that your GEM cluster is available at the host gem.example.com over HTTPS.

First, set the token as a variable to use for the subsequent commands.

$ export API_TOKEN="the long token string you copied"

Next, we’ll make a request to the Prometheus query endpoint of GEM looking for a particular metric. In this case, grafana_metrics_enterprise_build_info

$ curl -s -u "__system__:$API_TOKEN" "https://gem.example.com/prometheus/api/v1/query?query=grafana_metrics_enterprise_build_info" | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "grafana_metrics_enterprise_build_info",
          "branch": "gem-release-1.4",
          "goversion": "go1.16.3",
          "instance": "ingester-01:80",
          "revision": "ccd12b7a",
          "target": "ingester",
          "version": "v1.4.1"
        },
        "value": [
          1622833381.751,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "grafana_metrics_enterprise_build_info",
          "branch": "gem-release-1.4",
          "goversion": "go1.16.3",
          "instance": "distributor-01:80",
          "revision": "ccd12b7a",
          "target": "distributor",
          "version": "v1.4.1"
        },
        "value": [
          1622833381.751,
          "1"
        ]
      },

      <...snip...>
    ]
  }
}

As you can see, querying self-monitoring metrics with GEM is the same process as querying any other type of metrics.

Implementation

Though you don’t need to be familiar with how self-monitoring works at a technical level, it’s detailed below in the hopes that it’s useful.

Gathering

Self-monitoring metrics are gathered internally the same way metrics exposed via the /metrics endpoint are: they are registered with a Prometheus Registerer on application start up. The metrics are updated during the normal course of running the application and periodically ( every 15 seconds by default) flushed directly to a distributor. Any metric available from the /metrics endpoint of a GEM component will also be available in the self-monitoring system.

The metrics are written to the distributor over its gRPC interface. This allows the self-monitoring system control over the exact tenant the metrics are stored under. This enables it to cleanly separate system metrics (under the __system__ tenant) from user data.

Injected labels

Normally, when metrics are scraped by Prometheus, labels are automatically added by Prometheus that identify where the metrics came from. Since self-monitoring metrics are not scraped by any external system, labels are automatically added internally to help identify which component the metrics came from.

The following labels are added to metrics emitted by the self-monitoring system.

  • instance: this label is made up of the node or host name a component is running on in combination with the HTTP port used. For example a value for this label in a GEM cluster running on Kubernetes might be ingester-1:80 or querier-5bf6ddccd7-hzbtn:80.

  • target: this label is made of a comma separated list of the targets a GEM process is running as (ingester, querier, etc.) or all in single binary mode.

System tenant and access policy

In order to cleanly separate self-monitoring data from user data, GEM comes with a built-in __system__ tenant and __system__ access policy. All self-monitoring data is written to the __system__ tenant. The self-monitoring data may be queried using tokens associated with the__system__ access policy. Because these are built into GEM itself, they cannot be removed. However, writing self-monitoring metrics to the system tenant can be turned off using the flag -instrumentation.enabled=false or the associated configuration setting.

Recording rules

In order to use self-monitoring metrics to power associated self-monitoring dashboards, the GEM ruler also includes built-in recording rules. These recording rules perform aggregations of self-monitoring metrics they same way the ruler aggregates other metrics. Because these recording rules are built-in to GEM itself, they cannot be removed. However, they can be turned off using the same flag that enables or disables self-monitoring -instrumentation.enabled=false or the associated configuration setting.

Overhead

Self-monitoring metrics are stored in GEM itself. Like any other metrics, they consume space in object storage. When enabled in microservices mode, each GEM component (ingester, querier, etc) will emit approximately 2000 series per component. These series are emitted for each component and GEM duplicates them based on the replication factor in the ingesters.

To understand how many series will be written under the __system__ tenant as part of self-monitoring, you can use the following formula:

2000 * $NUMBER_OF_GEM_PROCESSES * $REPLICATION_FACTOR

Since these series are written to GEM in a similar way to other series, they’ll be deduplicated by the compactor in object storage to reduce space required. To understand how many series will end up in object storage via the __system__ tenant, you can use the following formula:

2000 * $NUMBER_OF_GEM_PROCESSES