When the load on Grafana Alloy is low, process all necessary telemetry signals in a single process. For example, a single collector can handle all incoming metrics, logs, and traces.
As telemetry volume increases, scale the data collector to handle the load.
The following section guides you through our recommendations for scaling the data collector when sampling is enabled.
For Application Observability, sample at the data collector after metrics generation so all traces are available to generate accurate metrics. The sampling strategy applies to traces, and only sampled traces are sent to the backend.
This pipeline makes the data collector stateful, containing these stateful components:
In tracing, a stateful component aggregates certain spans to work correctly.
The span metrics connector requires all spans with service.name to be processed by the same collector
The service graph connector pairs each “client” span with a “server” span to calculate metrics such as span duration
The tail sampling processor requires all spans with traceID to be processed by the same collector
To scale this pipeline, deploy two layers of collectors.
The first layer enriches data, exports application metrics and logs to backends, and load balances traces using a
otelcol.exporter.loadbalancing for Grafana Agent and
loadbalancingexporter for OpenTelemetry collector.
The second layer performs metrics generation and sampling, then exports sampled traces and generated metrics.
To identify series generated by different collectors in the second layer, add an additional label, collector_id.
Solve cardinality issues from collector_id labels using
Adaptive Metrics.
To view the Grafana Alloy configuration for the first layer, select the river tab below. To view the OpenTelemetry Collector configuration for the first layer, select the yaml tab below.
The first layer collector is stateless. Scaling stateless collector is easy, as an off-the-shelf layer 4 load-balancer would be sufficient.
The collector has three resolvers for the load-balancing exporter static, dns, and k8s.
static: A static list of backends is provided in the configuration. This is suitable when the backends are static and scaling isn’t expected.
dns: A hostname is provided as a parameter which the resolver periodically queries to discover IPs and update the load-balancer ring. When multiple instances are used, there is a chance they can momentarily have a different view of the system while they sync after a refresh. This can result in some spans for the same trace ID being sent to multiple hosts. Determine if this acceptable for the system, and use a longer refresh interval to reduce the effect of being out of sync.
kubernetes: A resolver that implements a watcher using Kubernetes APIs to get notifications when the list of pods backing a service is changed. This should reduce the amount of time when cluster views differ between nodes, effectively being a better solution than the DNS resolver when Kubernetes is used.
To view the Grafana Alloy configuration for the second layer, select the river tab below. To view the OpenTelemetry Collector configuration for the second layer, select the yaml tab below.
river
otelcol.receiver.otlp"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.receiver.otlp/// configures the default grpc endpoint "0.0.0.0:4317"grpc{}output{traces =[otelcol.processor.tail_sampling.default.input,otelcol.connector.servicegraph.default.input,otelcol.connector.spanmetrics.default.input,
]
}}otelcol.connector.spanmetrics"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.spanmetrics/dimension{name ="service.namespace"}dimension{name ="service.version"}dimension{name ="deployment.environment"}dimension{name ="k8s.cluster.name"}dimension{name ="k8s.namespace.name"}dimension{name ="cloud.region"}dimension{name ="cloud.availability_zone"}histogram{explicit{buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]}unit ="s"}output{metrics =[otelcol.processor.filter.drop_unneeded_span_metrics.input]}}otelcol.processor.filter"drop_unneeded_span_metrics"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.filter/error_mode ="ignore"metrics{datapoint =["IsMatch(metric.name, \"calls|duration\") and IsMatch(attributes[\"span.kind\"], \"SPAN_KIND_INTERNAL\")",
]
}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.connector.servicegraph"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.servicegraph/dimensions =["service.namespace","service.version","deployment.environment","k8s.cluster.name","k8s.namespace.name","cloud.region","cloud.availability_zone",]latency_histogram_buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]store{ttl ="2s"}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.processor.batch"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.batch/output{metrics =[otelcol.exporter.otlphttp.grafana_cloud.input]traces =[otelcol.exporter.otlphttp.grafana_cloud.input]}}otelcol.processor.tail_sampling"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.tail_sampling/// Examples: keep all traces that take more than 5000 mspolicy{name ="all_traces_above_5000ms"type ="latency"latency ={threshold_ms =5000,}}output{traces =[otelcol.processor.batch.default.input]}}otelcol.exporter.otlphttp"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.exporter.otlphttp/client{endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")auth = otelcol.auth.basic.grafana_cloud.handler
}}otelcol.auth.basic"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.auth.basic/username = env("GRAFANA_CLOUD_INSTANCE_ID")password = env("GRAFANA_CLOUD_API_KEY")}
yaml
# Tested with OpenTelemetry Collector Contrib v0.94.0receivers:otlp:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiverprotocols:grpc:processors:batch:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessorfilter/drop_unneeded_span_metrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessorerror_mode: ignore
metrics:datapoint:-'IsMatch(metric.name, "calls|duration") and IsMatch(attributes["span.kind"], "SPAN_KIND_INTERNAL")'tail_sampling:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessorpolicies:# Examples: keep all traces that take more than 5000 ms[{name: all_traces_above_5000ms,type: latency,latency:{threshold_ms:5000},},]connectors:servicegraph:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/servicegraphconnectordimensions:- service.namespace
- service.version
- deployment.environment
- k8s.cluster.name
- k8s.namespace.name
- cloud.region
- cloud.availability_zone
spanmetrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnectorhistogram:unit: s
dimensions:-name: service.namespace
-name: service.version
-name: deployment.environment
-name: k8s.cluster.name
-name: k8s.namespace.name
-name: cloud.region
-name: cloud.availability_zone
exporters:otlphttp/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporterendpoint:"${env:GRAFANA_CLOUD_OTLP_ENDPOINT}"auth:authenticator: basicauth/grafana_cloud
add_metric_suffixes:falseextensions:basicauth/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/basicauthextensionclient_auth:username:"${env:GRAFANA_CLOUD_INSTANCE_ID}"password:"${env:GRAFANA_CLOUD_API_KEY}"service:extensions:[basicauth/grafana_cloud]pipelines:traces:receivers:[otlp]processors:[]exporters:[servicegraph, spanmetrics]traces/grafana_cloud_traces:receivers:[otlp]processors:[tail_sampling, batch]exporters:[otlphttp/grafana_cloud]metrics/spanmetrics:receivers:[spanmetrics]processors:[
filter/drop_unneeded_span_metrics,
batch,]exporters:[otlphttp/grafana_cloud]metrics/servicegraph:receivers:[servicegraph]processors:[batch]exporters:[otlphttp/grafana_cloud]
The Legacy option for span metrics source in the
configuration is for customers who use Grafana Alloy or OpenTelemetry Collector with metric names that match those used by the Tempo metrics generator.
If you chose the Legacy option for span metrics source you should use legacy configuration below.
To view the Grafana Alloy legacy configuration for the second layer, select the river tab below. To view the OpenTelemetry Collector legacy configuration for the second layer, select the yaml tab below.
river
otelcol.receiver.otlp"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.receiver.otlp/// configures the default grpc endpoint "0.0.0.0:4317"grpc{}output{traces =[otelcol.processor.tail_sampling.default.input,otelcol.connector.servicegraph.default.input,otelcol.connector.spanmetrics.default.input,
]
}}otelcol.connector.spanmetrics"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.spanmetrics/dimension{name ="service.namespace"}dimension{name ="service.version"}dimension{name ="deployment.environment"}dimension{name ="k8s.cluster.name"}dimension{name ="k8s.namespace.name"}dimension{name ="cloud.region"}dimension{name ="cloud.availability_zone"}histogram{explicit{buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]}unit ="s"}namespace ="traces.spanmetrics"output{metrics =[otelcol.processor.filter.drop_unneeded_span_metrics.input]}}otelcol.processor.transform"use_grafana_metric_names"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.transform/error_mode ="ignore"metric_statements{context ="metric"statements =["set(name, \"traces.spanmetrics.latency\") where name == \"traces.spanmetrics.duration\"","set(name, \"traces.spanmetrics.calls.total\") where name == \"traces.spanmetrics.calls\"", ]
}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.processor.filter"drop_unneeded_span_metrics"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.filter/error_mode ="ignore"metrics{datapoint =["IsMatch(metric.name, \"traces.spanmetrics.calls|traces.spanmetrics.duration\") and IsMatch(attributes[\"span.kind\"], \"SPAN_KIND_INTERNAL\")",
]
}output{metrics =[otelcol.processor.transform.use_grafana_metric_names.input]}}otelcol.connector.servicegraph"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.servicegraph/dimensions =["service.namespace","service.version","deployment.environment","k8s.cluster.name","k8s.namespace.name","cloud.region","cloud.availability_zone",]latency_histogram_buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]store{ttl ="2s"}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.processor.batch"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.batch/output{metrics =[otelcol.exporter.otlphttp.grafana_cloud.input]traces =[otelcol.exporter.otlphttp.grafana_cloud.input]}}otelcol.processor.tail_sampling"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.tail_sampling/// Examples: keep all traces that take more than 5000 mspolicy{name ="all_traces_above_5000ms"type ="latency"latency ={threshold_ms =5000,}}output{traces =[otelcol.processor.batch.default.input]}}otelcol.exporter.otlphttp"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.exporter.otlphttp/client{endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")auth = otelcol.auth.basic.grafana_cloud.handler
}}otelcol.auth.basic"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.auth.basic/username = env("GRAFANA_CLOUD_INSTANCE_ID")password = env("GRAFANA_CLOUD_API_KEY")}
yaml
# Tested with OpenTelemetry Collector Contrib v0.94.0receivers:otlp:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiverprotocols:grpc:processors:batch:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessorfilter/drop_unneeded_span_metrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessorerror_mode: ignore
metrics:datapoint:-'IsMatch(metric.name, "traces.spanmetrics.calls|traces.spanmetrics.duration") and IsMatch(attributes["span.kind"], "SPAN_KIND_INTERNAL")'transform/use_grafana_metric_names:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessorerror_mode: ignore
metric_statements:-context: metric
statements:- set(name, "traces.spanmetrics.latency") where name == "traces.spanmetrics.duration"
- set(name, "traces.spanmetrics.calls.total") where name == "traces.spanmetrics.calls"
tail_sampling:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessorpolicies:# Examples: keep all traces that take more than 5000 ms[{name: all_traces_above_5000ms,type: latency,latency:{threshold_ms:5000},},]connectors:servicegraph:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/servicegraphconnectordimensions:- service.namespace
- service.version
- deployment.environment
- k8s.cluster.name
- k8s.namespace.name
- cloud.region
- cloud.availability_zone
spanmetrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnectornamespace: traces.spanmetrics
histogram:unit: s
dimensions:-name: service.namespace
-name: service.version
-name: deployment.environment
-name: k8s.cluster.name
-name: k8s.namespace.name
-name: cloud.region
-name: cloud.availability_zone
exporters:otlphttp/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporterendpoint:"${env:GRAFANA_CLOUD_OTLP_ENDPOINT}"auth:authenticator: basicauth/grafana_cloud
extensions:basicauth/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/basicauthextensionclient_auth:username:"${env:GRAFANA_CLOUD_INSTANCE_ID}"password:"${env:GRAFANA_CLOUD_API_KEY}"service:extensions:[basicauth/grafana_cloud]pipelines:traces:receivers:[otlp]processors:[]exporters:[servicegraph, spanmetrics]traces/grafana_cloud_traces:receivers:[otlp]processors:[tail_sampling, batch]exporters:[otlphttp/grafana_cloud_traces]metrics/spanmetrics:receivers:[spanmetrics]processors:[
filter/drop_unneeded_span_metrics,
transform/use_grafana_metric_names,
batch,]exporters:[otlphttp/grafana_cloud]metrics/servicegraph:receivers:[servicegraph]processors:[batch]exporters:[otlphttp/grafana_cloud]