This is archived documentation for v1.5.0. Go to the latest version.

Self-monitoring

NOTE Self-monitoring is an experimental feature. As such, the configuration settings, command line flags, or specifics of the implementation are subject to change.

Overview

Since version 1.4, Grafana Enterprise Metrics (GEM) includes the ability to directly record self-monitoring metrics to allow you to easily monitor the health and stability of GEM itself. The metrics GEM collects about itself are written to a built-in __system__ instance. The metrics written can be queried as usual using tokens created under the built-in __system__ access policy.

The way self-monitoring works ensures that any metrics available from GEM via /metrics endpoints will be available directly in GEM without needing to be scraped by an external process. While these metrics would ordinarily need to be scraped using Prometheus or the Grafana Agent, with self-monitoring they will be available after following the quick setup described below.

The goal of this feature is to provide a simple, out-of-the-box way to monitor GEM itself with a minimum amount of configuration or extra dependencies. To get the full value of this feature, we recommend you install GEM’s Grafana plug-in, which automatically provisions a set of dashboards that use the self-monitoring metrics. The dashboards are designed in line with Grafana Labs' best practices for understanding GEM system health. Self-monitoring is compatible with plugin versions >= 3.0.4 (which require Grafana 8). Grafana 7.5 users should use version 2.1.1.

Configuration

The sections below describe the steps needed to set up self monitoring.

Single binary mode

If you’re running GEM in single binary mode, you’ll need to do the following to enable each component to start emitting self-monitoring metrics.

Add the following section to your GEM configuration file used by each GEM pod or process.

instrumentation:
  enabled: true

Or, you can alternatively add the command line flag -instrumentation.enabled=true to the arguments passed to each GEM pod or process.

In single binary mode, that should be all you need!

Microservices mode

In order to enable self-monitoring, you’ll need a hostname that you can use to address the gPRC port (9095 by default) of each of the GEM distributors. This could be a load balancer that balances between each distributor, a DNS A record that includes IPs for each distributor, or a Kubernetes service that balances between each gRPC port of the distributor pods. For the purposes of this example, we’ll assume that you are using a Kubernetes service and GEM is running in a namespace called enterprise-metrics.

Add the following section to your GEM configuration file used by each GEM pod or process.

instrumentation:
  enabled: true
  distributor_client:
    address: dns:///distributor.enterprise-metrics.svc.cluster.local:9095

Or, you can alternatively add the command line flags to the arguments passed to each GEM pod or process.

  • -instrumentation.enabled=true
  • -instrumentation.distributor-client.address='dns:///distributor.enterprise-metrics.svc.cluster.local:9095'

Verification

After you’ve deployed the configuration changes above, you’ll need to verify that self-monitoring is working correctly. We’ll learn how to query the self-monitoring metrics later, but to verify they’re working we can check a simple counter incremented when self-monitoring metrics are emitted.

Pick a single pod or process that is part of your GEM cluster. For this example, we’ll assume that you have picked an ingester.

Make a curl request to the /metrics endpoint of the ingester.

$ curl -s 'http://ingester-01.example.com/metrics' | grep 'cortex_self_monitoring_pushes_total'
# HELP cortex_self_monitoring_pushes_total Number of successes pushing self-monitoring metrics
# TYPE cortex_self_monitoring_pushes_total counter
cortex_self_monitoring_pushes_total 15

NOTE If you are running GEM in a Kubernetes cluster, individual pods might not be directly accessible from outside the Kubernetes cluster. In this case you can make the request from another pod running in the Kubernetes cluster, or you can make use of the kubectl port-forward command.

If the metric above is 0 or doesn’t exist, check the logs for each GEM component looking for errors or warnings related to pushing metrics to a distributor.

Querying

In order to query self-monitoring metrics directly, you’ll need to create a token associated with the __system__ access policy. The steps below assume you have already done this and copied down the token. The following examples further assume that your GEM cluster is available at the host gem.example.com over HTTPS.

First, set the token as a variable to use for the subsequent commands.

$ export API_TOKEN="the long token string you copied"

Next, we’ll make a request to the Prometheus query endpoint of GEM looking for a particular metric. In this case, grafana_metrics_enterprise_build_info

$ curl -s -u "__system__:$API_TOKEN" "https://gem.example.com/prometheus/api/v1/query?query=grafana_metrics_enterprise_build_info" | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "grafana_metrics_enterprise_build_info",
          "branch": "gem-release-1.4",
          "goversion": "go1.16.3",
          "instance": "ingester-01:80",
          "revision": "ccd12b7a",
          "target": "ingester",
          "version": "v1.4.1"
        },
        "value": [
          1622833381.751,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "grafana_metrics_enterprise_build_info",
          "branch": "gem-release-1.4",
          "goversion": "go1.16.3",
          "instance": "distributor-01:80",
          "revision": "ccd12b7a",
          "target": "distributor",
          "version": "v1.4.1"
        },
        "value": [
          1622833381.751,
          "1"
        ]
      },
      
      <...snip...>
    ]
  }
}

As you can see, querying self-monitoring metrics with GEM is the same process as querying any other type of metrics.

Implementation

Though you don’t need to be familiar with how self-monitoring works at a technical level, it’s detailed below in the hopes that it’s useful.

Gathering

Self-monitoring metrics are gathered internally the same way metrics exposed via the /metrics endpoint are: they are registered with a Prometheus Registerer on application start up. The metrics are updated during the normal course of running the application and periodically ( every 15 seconds by default) flushed directly to a distributor. Any metric available from the /metrics endpoint of a GEM component will also be available in the self-monitoring system.

The metrics are written to the distributor over its gRPC interface. This allows the self-monitoring system control over the exact instance the metrics are stored under. This enables it to cleanly separate system metrics (under the __system__ instance) from user data.

Injected labels

Normally, when metrics are scraped by Prometheus, labels are automatically added by Prometheus that identify where the metrics came from. Since self-monitoring metrics are not scraped by any external system, labels are automatically added internally to help identify which component the metrics came from.

The following labels are added to metrics emitted by the self-monitoring system.

  • instance: this label is made up of the node or host name a component is running on in combination with the HTTP port used. For example a value for this label in a GEM cluster running on Kubernetes might be ingester-1:80 or querier-5bf6ddccd7-hzbtn:80.

  • target: this label is made of a comma separated list of the targets a GEM process is running as (ingester, querier, etc.) or all in single binary mode.

System instance and access policy

In order to cleanly separate self-monitoring data from user data, GEM comes with a built-in __system__ instance and __system__ access policy. All self-monitoring data is written to the __system__ instance. The self-monitoring data may be queried using tokens associated with the__system__ access policy. Because these are built into GEM itself, they cannot be removed. However, writing self-monitoring metrics to the system instance can be turned off using the flag -instrumentation.enabled=false or the associated configuration setting.

Recording rules

In order to use self-monitoring metrics to power associated self-monitoring dashboards, the GEM ruler also includes built-in recording rules. These recording rules perform aggregations of self-monitoring metrics they same way the ruler aggregates other metrics. Because these recording rules are built-in to GEM itself, they cannot be removed. However, they can be turned off using the same flag that enables or disables self-monitoring -instrumentation.enabled=false or the associated configuration setting.

Overhead

Self-monitoring metrics are stored in GEM itself. Like any other metrics, they consume space in object storage. When enabled in microservices mode, each GEM component (ingester, querier, etc) will emit approximately 2000 series per component. These series are emitted for each component and GEM duplicates them based on the replication factor in the ingesters.

To understand how many series will be written under the __system__ instance as part of self-monitoring, you can use the following formula:

2000 * $NUMBER_OF_GEM_PROCESSES * $REPLICATION_FACTOR

Since these series are written to GEM in a similar way to other series, they’ll be deduplicated by the compactor in object storage to reduce space required. To understand how many series will end up in object storage via the __system__ instances, you can use the following formula:

2000 * $NUMBER_OF_GEM_PROCESSES