This guide helps you operate the Grafana Agent.
The core of Grafana Agent is considered stable and suitable for production use. Individual features of Grafana Agent may have stability falling under one of the three categories:
Experimental: we are exploring a new use case and would like feedback. Experimental features are subject to frequent breaking changes during development. Experimental features may be removed with no equivalent replacement. Experimental features are always hidden behind feature flags. Unless removed, experimental features will eventually graduate to beta.
Beta: we are working on maturing a specific feature. Beta features may be subject to some breaking changes during development. Beta features may be replaced by equivalent functionality which covers that same use case. Beta features can be used without feature flags. Unless replaced by equivalent functionality, beta features will eventually graduate to stable.
Stable: we believe this functionality is stable, and breaking changes to configuration will be rare and well-documented. We will communicate deprecation and removal timeline if a stable feature is chosen to be removed or replaced. Stable features can be used without feature flags.
There is a best-effort attempt to mark features as one of these three in documentation; open an issue if it’s not clear what the stability of a specific feature is.
There are three options to horizontally scale your deployment of Grafana Agents:
- Host filtering requires you to run one Agent on every machine you wish to collect metrics from. Agents will only collect metrics from the machines they run on.
- Hashmod sharding allows you to roughly shard the discovered set of targets by using hashmod/keep relabel rules.
- The scraping service allows you to cluster Grafana Agents and have them distribute per-tenant configs throughout the cluster.
Each has their own set of tradeoffs:
- Host Filtering (Beta)
- Does not need specialized configs per agent
- No external dependencies required to operate
- Can cause significant load on service discovery APIs
- Requires each Agent to have the same list of scrape configs/remote_writes
- Hashmod sharding (Stable)
- Exact control on the number of shards to run
- Smaller load on SD compared to host filtering (as there are a smaller # of Agents)
- No external dependencies required to operate
- Each Agent must have a specialized config with their shard number inserted into the hashmod/keep relabel rule pair.
- Requires each Agent to have the same list of scrape configs/remote_writes, with the exception of the hashmod rule being different.
- Hashmod is not consistent hashing, so up to 100% of jobs will move to a new machine when scaling shards.
- Scraping service (Beta)
- Agents don’t have to have a synchronized set of scrape configs / remote_writes (they pull from a centralized location).
- Exact control on the number of shards to run.
- Uses consistent hashing, so only 1/N jobs will move to a new machine when scaling shards.
- Smallest load on SD compared to host filtering, as only one Agent is responsible for a config.
- Centralized configs must discover a minimal set of targets to distribute evenly.
- Requires running a separate KV store to store the centralized configs.
- Managing centralized configs adds operational burden over managing a config file.
Host filtering (Beta)
Host filtering implements a form of “dumb sharding,” where operators may deploy one Grafana Agent instance per machine in a cluster, all using the same configuration, and the Grafana Agents will only scrape targets that are running on the same node as the Agent.
host_filter: true means that if you have a target whose host
machine is not also running a Grafana Agent process, that target will not
Host filtering is usually paired with a dedicated Agent process that is used for
scraping targets that are running outside of a given cluster. For example, when
running the Grafana Agent on GKE, you would have a DaemonSet with
host_filter for scraping in-cluster targets, and a single dedicated Deployment
for scraping other targets that are not running on a cluster node, such as the
Kubernetes control plane API.
If you want to scale your scrape load without host filtering, you may use the scraping service instead.
The host name of the Agent is determined by reading
isn’t defined, the Agent will use Go’s os.Hostname
to determine the hostname.
The following meta-labels are used to determine if a target is running on the same machine as the target:
The final label,
__host__, isn’t a label added by any Prometheus service
discovery mechanism. Rather,
__host__ can be generated by using
host_filter_relabel_configs. This allows for custom relabeling
rules to determine the hostname where the predefined ones fail. Relabeling rules
host_filter_relabel_configs are temporary and just used for the
host_filtering mechanism. Full relabeling rules should be applied in the
Note that scrape_config
relabel_configs do not apply to the host filtering
host_filter_relabel_configs will work.
If the determined hostname matches any of the meta labels, the discovered target is allowed. Otherwise, the target is ignored, and will not show up in the targets API.
Hashmod sharding (Stable)
Grafana Agents can be sharded by using a pair of hashmod/keep relabel rules. These rules will hash the address of a target and modulus it with the number of Agent shards that are running.
scrape_configs: - job_name: some_job # Add usual service discovery here, such as static_configs relabel_configs: - source_labels: [__address__] modulus: 4 # 4 shards target_label: __tmp_hash action: hashmod - source_labels: [__tmp_hash] regex: ^1$ # This is the 2nd shard action: keep
relabel_configs to all of your scrape_config blocks. Ensure that each
running Agent shard has a different value for the
regex; the first Agent shard
^0$, the second should have
^1$, and so on, up to
This sharding mechanism means each Agent will ignore roughly 1/N of the total targets, where N is the number of shards. This allows for horizontal scaling the number of Agents and distributing load between them.
Note that the hashmod used here is not a consistent hashing algorithm; this means that changing the number of shards may cause any number of targets to move to a new shard, up to 100%. When moving to a new shard, any existing data in the WAL from the old machine is effectively discarded.
The Grafana Agent defines a concept of a Prometheus Instance, which is
its own mini Prometheus-lite server. The instance runs a combination of
Prometheus service discovery, scraping, a WAL for storage, and
Instances allow for fine grained control of what data gets scraped and where it gets sent. Users can easily define two Instances that scrape different subsets of metrics and send them to two completely different remote_write systems.
Instances are especially relevant to the scraping service mode, where breaking up your scrape configs into multiple Instances is required for sharding and balancing scrape load across a cluster of Agents.
Instance sharing (Stable)
The v0.5.0 release of the Agent introduced the concept of instance sharing,
which combines scrape_configs from compatible instance configs into a single,
shared Instance. Instance configs are compatible when they have no differences
in configuration with the exception of what they scrape.
may also differ in the order which endpoints are declared, but the unsorted
remote_writes must still be an exact match.
In the shared instances mode, the
name field of
remote_write configs is
ignored. The resulting
remote_write configs will have a name identical to the
first six characters of the group name and the first six characters of the hash
remote_write config separated by a
The shared instances mode is the new default, and the previous behavior is
deprecated. If you wish to restore the old behavior, set
instance_mode: distinct in the
metrics_config block of
your config file.
Shared instances are completely transparent to the user with the exception of
exposed metrics. With
instance_mode: shared, metrics for Prometheus components
(WAL, service discovery, remote_write, etc) have a
which is the hash of all settings used to determine the shared instance. When
instance_mode: distinct is set, the metrics for Prometheus components will
instead have an
instance_name label, which matches the name set on the
individual Instance config. It is recommended to use the default of
instance_mode: shared unless you don’t mind the performance hit and really
need granular metrics.
Users can use the targets API to see all scraped targets, and the name of the shared instance they were assigned to.