Operation guide

Operation guide

This guide helps you operate the Grafana Agent.

Stability

The core of Grafana Agent is considered stable and suitable for production use. Features and other functionality that are subject to change and are not recommended for production use will be tagged as either “beta” or “experimental.”

Horizontal Scaling

There are three options to horizontally scale your deployment of Grafana Agents:

  • Host filtering requires you to run one Agent on every machine you wish to collect metrics from. Agents will only collect metrics from the machines they run on.
  • Hashmod sharding allows you to roughly shard the discovered set of targets by using hashmod/keep relabel rules.
  • The scraping service allows you to cluster Grafana Agents and have them distribute per-tenant configs throughout the cluster.

Each has their own set of tradeoffs:

  • Host Filtering
    • Pros
      • Does not need specialized configs per agent
      • No external dependencies required to operate
    • Cons
      • Can cause significant load on service discovery APIs
      • Requires each Agent to have the same list of scrape configs/remote_writes
  • Hashmod sharding
    • Pros
      • Exact control on the number of shards to run
      • Smaller load on SD compared to host filtering (as there are a smaller # of Agents)
      • No external dependencies required to operate
    • Cons
      • Each Agent must have a specialized config with their shard number inserted into the hashmod/keep relabel rule pair.
      • Requires each Agent to have the same list of scrape configs/remote_writes, with the exception of the hashmod rule being different.
      • Hashmod is not consistent hashing, so up to 100% of jobs will move to a new machine when scaling shards.
  • Scraping service
    • Pros
      • Agents don’t have to have a synchronized set of scrape configs / remote_writes (they pull from a centralized location).
      • Exact control on the number of shards to run.
      • Uses consistent hashing, so only 1/N jobs will move to a new machine when scaling shards.
      • Smallest load on SD compared to host filtering, as only one Agent is responsible for a config.
    • Cons
      • Centralized configs must discover a minimal set of targets to distribute evenly.
      • Requires running a separate KV store to store the centralized configs.
      • Managing centralized configs adds operational burden over managing a config file.

Host filtering

Host filtering implements a form of “dumb sharding,” where operators may deploy one Grafana Agent instance per machine in a cluster, all using the same configuration, and the Grafana Agents will only scrape targets that are running on the same node as the Agent.

Running with host_filter: true means that if you have a target whose host machine is not also running a Grafana Agent process, that target will not be scraped!

Host filtering is usually paired with a dedicated Agent process that is used for scraping targets that are running outside of a given cluster. For example, when running the Grafana Agent on GKE, you would have a DaemonSet with host_filter for scraping in-cluster targets, and a single dedicated Deployment for scraping other targets that are not running on a cluster node, such as the Kubernetes control plane API.

If you want to scale your scrape load without host filtering, you may use the scraping service instead.

The host name of the Agent is determined by reading $HOSTNAME. If $HOSTNAME isn’t defined, the Agent will use Go’s os.Hostname to determine the hostname.

The following meta-labels are used to determine if a target is running on the same machine as the target:

  • __address__
  • __meta_consul_node
  • __meta_dockerswarm_node_id
  • __meta_dockerswarm_node_hostname
  • __meta_dockerswarm_node_address
  • __meta_kubernetes_pod_node_name
  • __meta_kubernetes_node_name
  • __host__

The final label, __host__, isn’t a label added by any Prometheus service discovery mechanism. Rather, __host__ can be generated by using host_filter_relabel_configs. This allows for custom relabeling rules to determine the hostname where the predefined ones fail. Relabeling rules added with host_filter_relabel_configs are temporary and just used for the host_filtering mechanism. Full relabeling rules should be applied in the appropriate scrape_config instead.

Note that scrape_config relabel_configs do not apply to the host filtering logic; only host_filter_relabel_configs will work.

If the determined hostname matches any of the meta labels, the discovered target is allowed. Otherwise, the target is ignored, and will not show up in the targets API.

Hashmod sharding

Grafana Agents can be sharded by using a pair of hashmod/keep relabel rules. These rules will hash the address of a target and modulus it with the number of Agent shards that are running.

scrape_configs:
- job_name: some_job
  # Add usual service discovery here, such as static_configs
  relabel_configs:
  - source_labels: [__address__]
    modulus:       4    # 4 shards
    target_label:  __tmp_hash
    action:        hashmod
  - source_labels: [__tmp_hash]
    regex:         ^1$  # This is the 2nd shard
    action:        keep

Add the relabel_configs to all of your scrape_config blocks. Ensure that each running Agent shard has a different value for the regex; the first Agent shard should have ^0$, the second should have ^1$, and so on, up to ^3$.

This sharding mechanism means each Agent will ignore roughly 1/N of the total targets, where N is the number of shards. This allows for horizontal scaling the number of Agents and distributing load between them.

Note that the hashmod used here is not a consistent hashing algorithm; this means that changing the number of shards may cause any number of targets to move to a new shard, up to 100%. When moving to a new shard, any existing data in the WAL from the old machine is effectively discarded.

Prometheus instances

The Grafana Agent defines a concept of a Prometheus Instance, which is its own mini Prometheus-lite server. The instance runs a combination of Prometheus service discovery, scraping, a WAL for storage, and remote_write.

Instances allow for fine grained control of what data gets scraped and where it gets sent. Users can easily define two Instances that scrape different subsets of metrics and send them to two completely different remote_write systems.

Instances are especially relevant to the scraping service mode, where breaking up your scrape configs into multiple Instances is required for sharding and balancing scrape load across a cluster of Agents.

Instance sharing

The v0.5.0 release of the Agent introduced the concept of instance sharing, which combines scrape_configs from compatible instance configs into a single, shared Instance. Instance configs are compatible when they have no differences in configuration with the exception of what they scrape. remote_write configs may also differ in the order which endpoints are declared, but the unsorted remote_writes must still be an exact match.

In the shared instances mode, the name field of remote_write configs is ignored. The resulting remote_write configs will have a name identical to the first six characters of the group name and the first six characters of the hash from that remote_write config separated by a -.

The shared instances mode is the new default, and the previous behavior is deprecated. If you wish to restore the old behavior, set instance_mode: distinct in the metrics_config block of your config file.

Shared instances are completely transparent to the user with the exception of exposed metrics. With instance_mode: shared, metrics for Prometheus components (WAL, service discovery, remote_write, etc) have a instance_group_name label, which is the hash of all settings used to determine the shared instance. When instance_mode: distinct is set, the metrics for Prometheus components will instead have an instance_name label, which matches the name set on the individual Instance config. It is recommended to use the default of instance_mode: shared unless you don’t mind the performance hit and really need granular metrics.

Users can use the targets API to see all scraped targets, and the name of the shared instance they were assigned to.