Caution
Grafana Alloy is the new name for our distribution of the OTel collector. Grafana Agent has been deprecated and is in Long-Term Support (LTS) through October 31, 2025. Grafana Agent will reach an End-of-Life (EOL) on November 1, 2025. Read more about why we recommend migrating to Grafana Alloy.
This is documentation for the next version of Agent. For the latest stable release, go to the latest version.
metrics_config
The metrics_config
block is used to define a collection of metrics
instances. Each instance defines a collection of Prometheus-compatible
scrape_configs and remote_write rules. Most users will only need to
define one instance.
scraping_service_config
The scraping_service
block configures the scraping service, an operational
mode where configurations are stored centrally in a KV store and a cluster of
agents distributes discovery and scrape load between nodes.
# Whether to enable scraping service mode. When enabled, local configs
# cannot be used.
[enabled: <boolean> | default = false]
# Note these next 3 configuration options are confusing. Due to backwards compatibility the naming
# is less than ideal.
# How often should the agent manually refresh the configuration. Useful for if KV change
# events are not sent by an agent.
[reshard_interval: <duration> | default = "1m"]
# The timeout for configuration refreshes. This can occur on cluster events or
# on the reshard interval. A timeout of 0 indicates no timeout.
[reshard_timeout: <duration> | default = "30s"]
# The timeout for a cluster reshard events. A timeout of 0 indicates no timeout.
[cluster_reshard_event_timeout: <duration> | default = "30s"]
# Configuration for the KV store to store configurations.
kvstore: <kvstore_config>
# When set, allows configs pushed to the KV store to specify configuration
# fields that can read secrets from files.
#
# This is disabled by default. When enabled, a malicious user can craft an
# instance config that reads arbitrary files on the machine the Agent runs
# on and sends its contents to a specically crafted remote_write endpoint.
#
# If enabled, ensure that no untrusted users have access to the Agent API.
[dangerous_allow_reading_files: <boolean>]
# Configuration for how agents will cluster together.
lifecycler: <lifecycler_config>
kvstore_config
The kvstore_config
block configures the KV store used as storage for
configurations in the scraping service mode.
# Which underlying KV store to use. Can be either consul or etcd
[store: <string> | default = ""]
# Key prefix to store all configurations with. Must end in /.
[prefix: <string> | default = "configurations/"]
# Configuration for a Consul client. Only applies if store
# is "consul"
consul:
# The hostname and port of Consul.
[host: <string> | duration = "localhost:8500"]
# The ACL Token used to interact with Consul.
[acltoken: <string>]
# The HTTP timeout when communicating with Consul
[httpclienttimeout: <duration> | default = 20s]
# Whether or not consistent reads to Consul are enabled.
[consistentreads: <boolean> | default = true]
# Configuration for an ETCD v3 client. Only applies if
# store is "etcd"
etcd:
# The ETCD endpoints to connect to.
endpoints:
- <string>
# The Dial timeout for the ETCD connection.
[dial_tmeout: <duration> | default = 10s]
# The maximum number of retries to do for failed ops to ETCD.
[max_retries: <int> | default = 10]
lifecycler_config
The lifecycler_config
block configures the lifecycler; the component that
Agents use to cluster together.
scraping_service_client_config
The scraping_service_client_config
block configures how clustered Agents will
generate gRPC clients to connect to each other.
grpc_client_config:
# Maximum size in bytes the gRPC client will accept from the connected server.
[max_recv_msg_size: <int> | default = 104857600]
# Maximum size in bytes the gRPC client will sent to the connected server.
[max_send_msg_size: <int> | default = 16777216]
# Whether messages should be gzipped.
[use_gzip_compression: <boolean> | default = false]
# The rate limit for gRPC clients; 0 means no rate limit.
[rate_limit: <float64> | default = 0]
# gRPC burst allowed for rate limits.
[rate_limit_burst: <int> | default = 0]
# Controls if when a rate limit is hit whether the client should
# retry the request.
[backoff_on_ratelimits: <boolean> | default = false]
# Configures the retry backoff when backoff_on_ratelimits is
# true.
backoff_config:
# The minimum delay when backing off.
[min_period: <duration> | default = "100ms"]
# The maximum delay when backing off.
[max_period: <duration> | default = "10s"]
# The number of times to backoff and retry before failing.
[max_retries: <int> | default = 10]
global_config
The global_config
block configures global values for all launched Prometheus
instances.
# How frequently should Prometheus instances scrape.
[scrape_interval: duration | default = "1m"]
# How long to wait before timing out a scrape from a target.
[scrape_timeout: duration | default = "10s"]
# A list of static labels to add for all metrics.
external_labels:
{ <string>: <string> }
# Default set of remote_write endpoints. If an instance doesn't define any
# remote_writes, it will use this list.
remote_write:
- [<remote_write>]
Note: For more information on remote_write, refer to the Prometheus documentation.
The following default values set by Grafana Agent Static Mode are different than the default set by Prometheus:
remote_write
:send_exemplars
default value istrue
remote_write
:queue_config
:retry_on_http_429
default value istrue
metrics_instance_config
The metrics_instance_config
block configures an individual metrics
instance, which acts as its own mini Prometheus-compatible agent, though
without support for the TSDB.
Note: More information on the following types can be found on the Prometheus website:
Data retention
The prometheus.remote_write
component uses a Write Ahead Log (WAL) to prevent
data loss during network outages. The component buffers the received metrics in
a WAL for each configured endpoint. The queue shards can use the WAL after the
network outage is resolved and flush the buffered metrics to the endpoints.
The WAL records metrics in 128 MB files called segments. To avoid having a WAL that grows on-disk indefinitely, the component truncates its segments on a set interval.
On each truncation, the WAL deletes references to series that are no longer present and also checkpoints roughly the oldest two thirds of the segments (rounded down to the nearest integer) written to it since the last truncation period. A checkpoint means that the WAL only keeps track of the unique identifier for each existing metrics series, and can no longer use the samples for remote writing. If that data has not yet been pushed to the remote endpoint, it is lost.
This behavior dictates the data retention for the prometheus.remote_write
component. It also means that it’s impossible to directly correlate data
retention directly to the data age itself, as the truncation logic works on
segments, not the samples themselves. This makes data retention less
predictable when the component receives a non-consistent rate of data.
The WAL block in Flow mode, or the metrics config in Static mode contain some configurable parameters that can be used to control the tradeoff between memory usage, disk usage, and data retention.
The truncate_frequency
or wal_truncate_frequency
parameter configures the
interval at which truncations happen. A lower value leads to reduced memory
usage, but also provides less resiliency to long outages.
When a WAL clean-up starts, the most recently successfully sent timestamp is
used to determine how much data is safe to remove from the WAL.
The min_keepalive_time
or min_wal_time
controls the minimum age of samples
considered for removal. No samples more recent than min_keepalive_time
are
removed. The max_keepalive_time
or max_wal_time
controls the maximum age of
samples that can be kept in the WAL. Samples older than
max_keepalive_time
are forcibly removed.
Extended remote_write
outages
When the remote write endpoint is unreachable over a period of time, the most
recent successfully sent timestamp is not updated. The
min_keepalive_time
and max_keepalive_time
arguments control the age range
of data kept in the WAL.
If the remote write outage is longer than the max_keepalive_time
parameter,
then the WAL is truncated, and the oldest data is lost.
Intermittent remote_write
outages
If the remote write endpoint is intermittently reachable, the most recent
successfully sent timestamp is updated whenever the connection is successful.
A successful connection updates the series’ comparison with
min_keepalive_time
and triggers a truncation on the next truncate_frequency
interval which checkpoints two thirds of the segments (rounded down to the
nearest integer) written since the previous truncation.
Falling behind
If the queue shards cannot flush data quickly enough to keep up-to-date with the most recent data buffered in the WAL, we say that the component is ‘falling behind’. It’s not unusual for the component to temporarily fall behind 2 or 3 scrape intervals. If the component falls behind more than one third of the data written since the last truncate interval, it is possible for the truncate loop to checkpoint data before being pushed to the remote_write endpoint.
WAL corruption
WAL corruption can occur when Grafana Agent unexpectedly stops while the latest WAL segments are still being written to disk. For example, the host computer has a general disk failure and crashes before you can stop Grafana Agent and other running services. When you restart Grafana Agent, it verifies the WAL, removing any corrupt segments it finds. Sometimes, this repair is unsuccessful, and you must manually delete the corrupted WAL to continue.
If the WAL becomes corrupted, Grafana Agent writes error messages such as
err="failed to find segment for index"
to the log file.
Note
Deleting a WAL segment or a WAL file permanently deletes the stored WAL data.
To delete the corrupted WAL:
Stop Grafana Agent.
Find and delete the contents of the
wal
directory.By default the
wal
directory is a subdirectory of thedata-agent
directory located in the Grafana Agent working directory. The WAL data directory may be different than the default depending on the wal_directory setting in your Static configuration file or the path specified by the Flow command line flag--storage-path
.Note
There is one
wal
directory per:- Metrics instance running in Static mode
prometheus.remote_write
component running in Flow mode
Start Grafana Agent and verify that the WAL is working correctly.