Menu

Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.

Grafana Agent Grafana Agent Operator Agent Operator architecture
Open source

Grafana Agent Operator architecture

Grafana Agent Operator works by watching for Kubernetes custom resources that specify how to collect telemetry data from your Kubernetes cluster and where to send it. Agent Operator manages corresponding Grafana Agent deployments in your cluster by watching for changes against the custom resources.

Grafana Agent Operator works in two phases—it discovers a hierarchy of custom resources and it reconciles that hierarchy into a Grafana Agent deployment.

Custom resource hierarchy

The root of the custom resource hierarchy is the GrafanaAgent resource—the primary resource Agent Operator looks for. GrafanaAgent is called the root because it discovers other sub-resources, MetricsInstance and LogsInstance. The GrafanaAgent resource endows them with Pod attributes defined in the GrafanaAgent specification, for example, Pod requests, limits, affinities, and tolerations, and defines the Grafana Agent image. You can only define Pod attributes at the GrafanaAgent level. They are propagated to MetricsInstance and LogsInstance Pods.

The full hierarchy of custom resources is as follows:

  • GrafanaAgent
    • MetricsInstance
      • PodMonitor
      • Probe
      • ServiceMonitor
    • LogsInstance
      • PodLogs

The following table describes these custom resources:

Custom resourcedescription
GrafanaAgentDiscovers one or more MetricsInstance and LogsInstance resources.
MetricsInstanceDefines where to ship collected metrics. This rolls out a Grafana Agent StatefulSet that will scrape and ship metrics to a remote_write endpoint.
ServiceMonitorCollects cAdvisor and kubelet metrics. This configures the MetricsInstance / Agent StatefulSet
LogsInstanceDefines where to ship collected logs. This rolls out a Grafana Agent DaemonSet that will tail log files on your cluster nodes.
PodLogsCollects container logs from Kubernetes Pods. This configures the LogsInstance / Agent DaemonSet.

Most of the Grafana Agent Operator resources have the ability to reference a ConfigMap or a Secret. All referenced ConfigMaps or Secrets are added into the resource hierarchy.

When a hierarchy is established, each item is watched for changes. Any changed item causes a reconcile of the root GrafanaAgent resource, either creating, modifying, or deleting the corresponding Grafana Agent deployment.

A single resource can belong to multiple hierarchies. For example, if two GrafanaAgents use the same Probe, modifying that Probe causes both GrafanaAgents to be reconciled.

To set up monitoring, Grafana Agent Operator works in the following two phases:

  • Builds (discovers) a hierarchy of custom resources.
  • Reconciles that hierarchy into a Grafana Agent deployment.

Agent Operator also performs sharding and replication and adds labels to every metric.

How Agent Operator builds the custom resource hierarchy

Grafana Agent Operator builds the hierarchy using label matching on the custom resources. The following figure illustrates the matching. The GrafanaAgent picks up the MetricsInstance and LogsInstance that match the label instance: primary. The instances pick up the resources the same way.

To validate the Secrets

The generated configurations are saved in Secrets. To download and validate them manually, use the following commands:

$ kubectl get secrets <???>-logs-config -o json | jq -r '.data."agent.yml"' | base64 --decode
$ kubectl get secrets <???>-config -o json | jq -r '.data."agent.yml"' | base64 --decode

How Agent Operator reconciles the custom resource hierarchy

When a resource hierarchy is created, updated, or deleted, a reconcile occurs. When a GrafanaAgent resource is deleted, the corresponding Grafana Agent deployment will also be deleted.

Reconciling creates the following cluster resources:

  1. A Secret that holds the Grafana Agent configuration is generated.
  2. A Secret that holds all referenced Secrets or ConfigMaps from the resource hierarchy is generated. This ensures that Secrets referenced from a custom resource in another namespace can still be read.
  3. A Service is created to govern the StatefulSets that are generated.
  4. One StatefulSet per Prometheus shard is created.

PodMonitors, Probes, and ServiceMonitors are turned into individual scrape jobs which all use Kubernetes Service Discovery (SD).

Sharding and replication

The GrafanaAgent resource can specify a number of shards. Each shard results in the creation of a StatefulSet with a hashmod + keep relabel_config per job:

yaml
- source_labels: [__address__]
  target_label: __tmp_hash
  modulus: NUM_SHARDS
  action: hashmod
- source_labels: [__tmp_hash]
  regex: CURRENT_STATEFULSET_SHARD
  action: keep

This allows for horizontal scaling capabilities, where each shard will handle roughly 1/N of the total scrape load. Note that this does not use consistent hashing, which means changing the number of shards will cause anywhere between 1/N to N targets to reshuffle.

The sharding mechanism is borrowed from the Prometheus Operator.

The number of replicas can be defined, similarly to the number of shards. This creates deduplicate shards. This must be paired with a remote_write system that can perform HA deduplication. Grafana Cloud and Mimir provide this out of the box, and the Grafana Agent Operator defaults support these two systems.

The total number of created metrics pods will be the product of numShards * numReplicas.

Added labels

Two labels are added by default to every metric:

  • cluster, representing the GrafanaAgent deployment. Holds the value of <GrafanaAgent.metadata.namespace>/<GrafanaAgent.metadata.name>.
  • __replica__, representing the replica number of the Agent. This label works out of the box with Grafana Cloud and Cortex’s HA deduplication.

The shard number is not added as a label, as sharding is designed to be transparent on the receiver end.