Grafana Agent Operator works by watching for Kubernetes custom resources that specify how to collect telemetry data from your Kubernetes cluster and where to send it. Agent Operator manages corresponding Grafana Agent deployments in your cluster by watching for changes against the custom resources.
Grafana Agent Operator works in two phases—it discovers a hierarchy of custom resources and it reconciles that hierarchy into a Grafana Agent deployment.
Custom resource hierarchy
The root of the custom resource hierarchy is the
GrafanaAgent resource—the primary resource Agent Operator looks for.
GrafanaAgent is called the root because it
discovers other sub-resources,
GrafanaAgent resource endows them with Pod attributes defined in the GrafanaAgent specification, for example, Pod requests, limits, affinities, and tolerations, and defines the Grafana Agent image. You can only define Pod attributes at the
GrafanaAgent level. They are propagated to MetricsInstance and LogsInstance Pods.
The full hierarchy of custom resources is as follows:
The following table describes these custom resources:
|Discovers one or more |
|Defines where to ship collected metrics. This rolls out a Grafana Agent StatefulSet that will scrape and ship metrics to a |
|Collects cAdvisor and kubelet metrics. This configures the |
|Defines where to ship collected logs. This rolls out a Grafana Agent DaemonSet that will tail log files on your cluster nodes.|
|Collects container logs from Kubernetes Pods. This configures the |
Most of the Grafana Agent Operator resources have the ability to reference a ConfigMap or a Secret. All referenced ConfigMaps or Secrets are added into the resource hierarchy.
When a hierarchy is established, each item is watched for changes. Any changed
item causes a reconcile of the root
GrafanaAgent resource, either
creating, modifying, or deleting the corresponding Grafana Agent deployment.
A single resource can belong to multiple hierarchies. For example, if two
GrafanaAgents use the same Probe, modifying that Probe causes both
GrafanaAgents to be reconciled.
To set up monitoring, Grafana Agent Operator works in the following two phases:
- Builds (discovers) a hierarchy of custom resources.
- Reconciles that hierarchy into a Grafana Agent deployment.
Agent Operator also performs sharding and replication and adds labels to every metric.
How Agent Operator builds the custom resource hierarchy
Grafana Agent Operator builds the hierarchy using label matching on the custom resources. The following figure illustrates the matching. The
GrafanaAgent picks up the
LogsInstance that match the label
instance: primary. The instances pick up the resources the same way.
To validate the Secrets
The generated configurations are saved in Secrets. To download and validate them manually, use the following commands:
$ kubectl get secrets <???>-logs-config -o json | jq -r '.data."agent.yml"' | base64 --decode $ kubectl get secrets <???>-config -o json | jq -r '.data."agent.yml"' | base64 --decode
How Agent Operator reconciles the custom resource hierarchy
When a resource hierarchy is created, updated, or deleted, a reconcile occurs.
GrafanaAgent resource is deleted, the corresponding Grafana Agent
deployment will also be deleted.
Reconciling creates the following cluster resources:
- A Secret that holds the Grafana Agent configuration is generated.
- A Secret that holds all referenced Secrets or ConfigMaps from the resource hierarchy is generated. This ensures that Secrets referenced from a custom resource in another namespace can still be read.
- A Service is created to govern the StatefulSets that are generated.
- One StatefulSet per Prometheus shard is created.
PodMonitors, Probes, and ServiceMonitors are turned into individual scrape jobs which all use Kubernetes Service Discovery (SD).
Sharding and replication
The GrafanaAgent resource can specify a number of shards. Each shard results in the creation of a StatefulSet with a hashmod + keep relabel_config per job:
- source_labels: [__address__] target_label: __tmp_hash modulus: NUM_SHARDS action: hashmod - source_labels: [__tmp_hash] regex: CURRENT_STATEFULSET_SHARD action: keep
This allows for horizontal scaling capabilities, where each shard will handle roughly 1/N of the total scrape load. Note that this does not use consistent hashing, which means changing the number of shards will cause anywhere between 1/N to N targets to reshuffle.
The sharding mechanism is borrowed from the Prometheus Operator.
The number of replicas can be defined, similarly to the number of shards. This
creates deduplicate shards. This must be paired with a
remote_write system that
can perform HA deduplication. Grafana Cloud and Mimir provide this out of the
box, and the Grafana Agent Operator defaults support these two systems.
The total number of created metrics pods will be the product of
numShards * numReplicas.
Two labels are added by default to every metric:
cluster, representing the
GrafanaAgentdeployment. Holds the value of
__replica__, representing the replica number of the Agent. This label works out of the box with Grafana Cloud and Cortex’s HA deduplication.
The shard number is not added as a label, as sharding is designed to be transparent on the receiver end.