Grafana Agent is an telemetry collector for sending metrics, logs, and trace data to the opinionated Grafana observability stack. It works best with:
- Grafana Cloud
- Grafana Enterprise Stack
- OSS deployments of Grafana Loki, Prometheus, Cortex, and Grafana Tempo
The Agent supports collecting telemetry data by utilizing the same battle-tested code from the official platforms. It uses Prometheus for metrics collection, Grafana Loki for log collection, and OpenTelemetry Collector for trace collection.
Grafana Agent uses less memory on average than Prometheus – by doing less (only focusing on
Grafana Agent allows for deploying multiple instances of the Agent in a cluster and only scraping metrics from targets that running at the same host. This allows distributing memory requirements across the cluster rather than pressurizing a single node.
Unlike Prometheus, the Grafana Agent is just targeting
so some Prometheus features, such as querying, local storage, recording rules,
and alerts aren’t present.
remote_write, service discovery, and relabeling
rules are included.
The Grafana Agent has a concept of an “instance”, each of which acts as
its own mini Prometheus agent with their own
scrape_configs section and
remote_write rules. More than one instance is useful when you want to have
completely separate configs that write to two different locations without
needing to worry about advanced metric relabeling rules. Multiple instances also
come into play for the Scraping Service Mode.
The Grafana Agent can be deployed in three modes:
The default deployment mode of the Grafana Agent is a drop-in
replacement for Prometheus
remote_write. The Agent will act similarly to a
single-process Prometheus, doing service discovery, scraping, and remote
Host Filtering mode is achieved by setting a
host_filter flag on a specific
instance inside the Agent’s configuration file. When this flag is set, the
instance will only scrape metrics from targets that are running on the same
machine as the instance itself. This is extremely useful to migrate to sharded
Prometheus instances in a Kubernetes cluster, where the Agent can be deployed as
a DaemonSet and distribute memory requirements across multiple nodes.
Note that Host Filtering mode and sharding your instances means that if an Agent’s metrics are being sent to an alerting system, alerts for that Agent may not be able to be generated if the entire node has problems. This changes the semantics of failure detection, and alerts would have to be configured to catch agents not reporting in.
The final mode, Scraping Service Mode
clusters a subset of agents. It acts as a go-between for the drop-in mode
(which does no automatic sharding) and
host_filter mode (which forces sharding
by node). The Scraping Service Mode clusters a set of agents with a set of
shared configs and distributes the scrape load automatically between them. For
more information, refer to (/docs/agent/latest/scraping-service/).
Host filtering configures Agents to scrape targets that are running on the same machine as the Grafana Agent process. It:
- Gets the hostname of the agent by the
HOSTNAMEenvironment variable or through the default.
- Checks if the hostname of the agent matches the label value for
__address__service-discovery-specific node labels against the discovered target.
If the filter passes, the target is allowed to be scraped. Otherwise, the target will be silently ignored and not scraped.
For detailed information on the host filtering mode, refer to the operation guide.
Grafana Agent supports collecting logs and sending them to Loki using its
loki subsystem. This is done using the upstream
Promtail client, which
is the official first-party log collection client created by the Loki
Grafana Agent supports collecting traces and sending them to Tempo using its
traces subsystem. This is done using the upstream OpenTelemetry Collector.
Agent can ingest OpenTelemetry, OpenCensus, Jaeger, Zipkin, or Kafka spans.
See documentation on how to configure receivers.
The agent is capable of exporting to any OpenTelemetry GRPC compatible system.
Comparison to alternatives
Grafana Agent is optimized for Grafana Cloud,
but can be used while using an on-prem
remote_write-compatible Prometheus API
and an on-prem Loki. Unlike alternatives, Grafana Agent extends the
official code with extra functionality. This allows the Agent to give an
experience closest to its official counterparts, unlike existing alternatives which
typically try to re-implement everything from scratch.
Why not just use Telegraf?
Telegraf is a fantastic project and was actually considered as an alternative to building our own agent. It could work, but ultimately it was not chosen due to lacking service discovery and metadata label propagation. While these features could theoretically be added to Telegraf as OSS contributions, there would be a lot of forced hacks involved due to its current design.
Additionally, Telegraf is a much larger project with its own goals for its community, so any changes need to fit the general use cases it was designed for.
With the Grafana Agent as its own project, we can deliver a more curated agent
specifically designed to work seamlessly with Grafana Cloud and other
remote_write compatible Prometheus endpoints as well as Loki for logs
and Tempo for traces, all-in-one.