If the load on Grafana Alloy is small, it’s recommended to process all necessary telemetry signals in the same process. For example, a single collector can process all of the incoming metrics, logs, traces.
When telemetry volume increases you should consider ways to scale the data collector.
The following section guide you through our recommendations on scaling the data collector when sampling is enabled.
For Application Observability, we recommend sampling at the data collector after metrics generation so that all traces are available to generate accurate metrics.
The sampling strategy applies to traces, and only sampled traces are sent to the backend.
This pipeline makes the data collector stateful, containing these stateful components:
In the context of tracing stateful component is a component that needs to aggregate certain spans to work correctly.
Span metrics connector needs all the spans with service.name to be processed by the same collector
Service graph connector needs to pair each “client” span with “server” span to calculate metric such as span duration
Tail sampling processor needs all the spans with traceID to be processed by the same collector
To scale this pipeline we recommend deploying a two-layer of collectors.
The first layer enriches data, exports application metrics and logs to backends, and load balances traces using a otelcol.exporter.loadbalancing for Grafana Agent and loadbalancingexporter for OpenTelemetry collector.
The second layer performs metrics generation and sampling and exports sampled traces and generated metrics.
In order to differentiate series generated by different collectors on the second layer we recommend adding additional label “collector_id”.
Cardinality issues due to “collector_id” labels can be solved using Adaptive Metrics.
To view the Grafana Alloy configuration for the first layer, select the river tab below. To view the OpenTelemetry Collector configuration for the first layer, select the yaml tab below.
The first layer collector is stateless. Scaling stateless collector is easy, as an off-the-shelf layer 4 load-balancer would be sufficient.
The collector has three resolvers for the load-balancing exporter static, dns, and k8s.
static: A static list of backends is provided in the configuration. This is suitable when the backends are static and scaling isn’t expected.
dns: A hostname is provided as a parameter which the resolver periodically queries to discover IPs and update the load-balancer ring. When multiple instances are used, there is a chance they can momentarily have a different view of the system while they sync after a refresh. This can result in some spans for the same trace ID being sent to multiple hosts. Determine if this acceptable for the system, and use a longer refresh interval to reduce the effect of being out of sync.
kubernetes: A resolver that implements a watcher using Kubernetes APIs to get notifications when the list of pods backing a service is changed. This should reduce the amount of time when cluster views differ between nodes, effectively being a better solution than the DNS resolver when Kubernetes is used.
To view the Grafana Alloy configuration for the second layer, select the river tab below. To view the OpenTelemetry Collector configuration for the second layer, select the yaml tab below.
river
otelcol.receiver.otlp"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.receiver.otlp/// configures the default grpc endpoint "0.0.0.0:4317"grpc{}output{traces =[otelcol.processor.tail_sampling.default.input,otelcol.connector.servicegraph.default.input,otelcol.connector.spanmetrics.default.input,
]
}}otelcol.connector.spanmetrics"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.spanmetrics/dimension{name ="service.namespace"}dimension{name ="service.version"}dimension{name ="deployment.environment"}dimension{name ="k8s.cluster.name"}dimension{name ="k8s.namespace.name"}dimension{name ="cloud.region"}dimension{name ="cloud.availability_zone"}histogram{explicit{buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]}unit ="s"}output{metrics =[otelcol.processor.filter.drop_unneeded_span_metrics.input]}}otelcol.processor.filter"drop_unneeded_span_metrics"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.filter/error_mode ="ignore"metrics{datapoint =["IsMatch(metric.name, \"calls|duration\") and IsMatch(attributes[\"span.kind\"], \"SPAN_KIND_INTERNAL\")",
]
}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.connector.servicegraph"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.servicegraph/dimensions =["service.namespace","service.version","deployment.environment","k8s.cluster.name","k8s.namespace.name","cloud.region","cloud.availability_zone",]latency_histogram_buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]store{ttl ="2s"}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.processor.batch"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.batch/output{metrics =[otelcol.exporter.otlphttp.grafana_cloud.input]traces =[otelcol.exporter.otlphttp.grafana_cloud.input]}}otelcol.processor.tail_sampling"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.tail_sampling/// Examples: keep all traces that take more than 5000 mspolicy{name ="all_traces_above_5000ms"type ="latency"latency ={threshold_ms =5000,}}output{traces =[otelcol.processor.batch.default.input]}}otelcol.exporter.otlphttp"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.exporter.otlphttp/client{endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")auth = otelcol.auth.basic.grafana_cloud.handler
}}otelcol.auth.basic"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.auth.basic/username = env("GRAFANA_CLOUD_INSTANCE_ID")password = env("GRAFANA_CLOUD_API_KEY")}
yaml
# Tested with OpenTelemetry Collector Contrib v0.94.0receivers:otlp:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiverprotocols:grpc:processors:batch:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessorfilter/drop_unneeded_span_metrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessorerror_mode: ignore
metrics:datapoint:-'IsMatch(metric.name, "calls|duration") and IsMatch(attributes["span.kind"], "SPAN_KIND_INTERNAL")'tail_sampling:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessorpolicies:# Examples: keep all traces that take more than 5000 ms[{name: all_traces_above_5000ms,type: latency,latency:{threshold_ms:5000},},]connectors:servicegraph:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/servicegraphconnectordimensions:- service.namespace
- service.version
- deployment.environment
- k8s.cluster.name
- k8s.namespace.name
- cloud.region
- cloud.availability_zone
spanmetrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnectorhistogram:unit: s
dimensions:-name: service.namespace
-name: service.version
-name: deployment.environment
-name: k8s.cluster.name
-name: k8s.namespace.name
-name: cloud.region
-name: cloud.availability_zone
exporters:otlphttp/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporterendpoint:"${env:GRAFANA_CLOUD_OTLP_ENDPOINT}"auth:authenticator: basicauth/grafana_cloud
add_metric_suffixes:falseextensions:basicauth/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/basicauthextensionclient_auth:username:"${env:GRAFANA_CLOUD_INSTANCE_ID}"password:"${env:GRAFANA_CLOUD_API_KEY}"service:extensions:[basicauth/grafana_cloud]pipelines:traces:receivers:[otlp]processors:[]exporters:[servicegraph, spanmetrics]traces/grafana_cloud_traces:receivers:[otlp]processors:[tail_sampling, batch]exporters:[otlphttp/grafana_cloud]metrics/spanmetrics:receivers:[spanmetrics]processors:[
filter/drop_unneeded_span_metrics,
batch,]exporters:[otlphttp/grafana_cloud]metrics/servicegraph:receivers:[servicegraph]processors:[batch]exporters:[otlphttp/grafana_cloud]
The Legacy option for span metrics source in the configuration is for customers who use Grafana Alloy or OpenTelemetry Collector with metric names that match those used by the Tempo metrics generator.
If you chose the Legacy option for span metrics source you should use legacy configuration below.
To view the Grafana Alloy legacy configuration for the second layer, select the river tab below. To view the OpenTelemetry Collector legacy configuration for the second layer, select the yaml tab below.
river
otelcol.receiver.otlp"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.receiver.otlp/// configures the default grpc endpoint "0.0.0.0:4317"grpc{}output{traces =[otelcol.processor.tail_sampling.default.input,otelcol.connector.servicegraph.default.input,otelcol.connector.spanmetrics.default.input,
]
}}otelcol.connector.spanmetrics"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.spanmetrics/dimension{name ="service.namespace"}dimension{name ="service.version"}dimension{name ="deployment.environment"}dimension{name ="k8s.cluster.name"}dimension{name ="k8s.namespace.name"}dimension{name ="cloud.region"}dimension{name ="cloud.availability_zone"}histogram{explicit{buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]}unit ="s"}namespace ="traces.spanmetrics"output{metrics =[otelcol.processor.filter.drop_unneeded_span_metrics.input]}}otelcol.processor.transform"use_grafana_metric_names"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.transform/error_mode ="ignore"metric_statements{context ="metric"statements =["set(name, \"traces.spanmetrics.latency\") where name == \"traces.spanmetrics.duration\"","set(name, \"traces.spanmetrics.calls.total\") where name == \"traces.spanmetrics.calls\"", ]
}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.processor.filter"drop_unneeded_span_metrics"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.filter/error_mode ="ignore"metrics{datapoint =["IsMatch(metric.name, \"traces.spanmetrics.calls|traces.spanmetrics.duration\") and IsMatch(attributes[\"span.kind\"], \"SPAN_KIND_INTERNAL\")",
]
}output{metrics =[otelcol.processor.transform.use_grafana_metric_names.input]}}otelcol.connector.servicegraph"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.connector.servicegraph/dimensions =["service.namespace","service.version","deployment.environment","k8s.cluster.name","k8s.namespace.name","cloud.region","cloud.availability_zone",]latency_histogram_buckets =["0s","0.005s","0.01s","0.025s","0.05s","0.075s","0.1s","0.25s","0.5s","0.75s","1s","2.5s","5s","7.5s","10s"]store{ttl ="2s"}output{metrics =[otelcol.processor.batch.default.input]}}otelcol.processor.batch"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.batch/output{metrics =[otelcol.exporter.otlphttp.grafana_cloud.input]traces =[otelcol.exporter.otlphttp.grafana_cloud.input]}}otelcol.processor.tail_sampling"default"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.processor.tail_sampling/// Examples: keep all traces that take more than 5000 mspolicy{name ="all_traces_above_5000ms"type ="latency"latency ={threshold_ms =5000,}}output{traces =[otelcol.processor.batch.default.input]}}otelcol.exporter.otlphttp"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.exporter.otlphttp/client{endpoint = env("GRAFANA_CLOUD_OTLP_ENDPOINT")auth = otelcol.auth.basic.grafana_cloud.handler
}}otelcol.auth.basic"grafana_cloud"{// https://grafana.com/docs/alloy/latest/reference/components/otelcol.auth.basic/username = env("GRAFANA_CLOUD_INSTANCE_ID")password = env("GRAFANA_CLOUD_API_KEY")}
yaml
# Tested with OpenTelemetry Collector Contrib v0.94.0receivers:otlp:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiverprotocols:grpc:processors:batch:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessorfilter/drop_unneeded_span_metrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessorerror_mode: ignore
metrics:datapoint:-'IsMatch(metric.name, "traces.spanmetrics.calls|traces.spanmetrics.duration") and IsMatch(attributes["span.kind"], "SPAN_KIND_INTERNAL")'transform/use_grafana_metric_names:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessorerror_mode: ignore
metric_statements:-context: metric
statements:- set(name, "traces.spanmetrics.latency") where name == "traces.spanmetrics.duration"
- set(name, "traces.spanmetrics.calls.total") where name == "traces.spanmetrics.calls"
tail_sampling:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessorpolicies:# Examples: keep all traces that take more than 5000 ms[{name: all_traces_above_5000ms,type: latency,latency:{threshold_ms:5000},},]connectors:servicegraph:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/servicegraphconnectordimensions:- service.namespace
- service.version
- deployment.environment
- k8s.cluster.name
- k8s.namespace.name
- cloud.region
- cloud.availability_zone
spanmetrics:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnectornamespace: traces.spanmetrics
histogram:unit: s
dimensions:-name: service.namespace
-name: service.version
-name: deployment.environment
-name: k8s.cluster.name
-name: k8s.namespace.name
-name: cloud.region
-name: cloud.availability_zone
exporters:otlphttp/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporterendpoint:"${env:GRAFANA_CLOUD_OTLP_ENDPOINT}"auth:authenticator: basicauth/grafana_cloud
extensions:basicauth/grafana_cloud:# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/basicauthextensionclient_auth:username:"${env:GRAFANA_CLOUD_INSTANCE_ID}"password:"${env:GRAFANA_CLOUD_API_KEY}"service:extensions:[basicauth/grafana_cloud]pipelines:traces:receivers:[otlp]processors:[]exporters:[servicegraph, spanmetrics]traces/grafana_cloud_traces:receivers:[otlp]processors:[tail_sampling, batch]exporters:[otlphttp/grafana_cloud_traces]metrics/spanmetrics:receivers:[spanmetrics]processors:[
filter/drop_unneeded_span_metrics,
transform/use_grafana_metric_names,
batch,]exporters:[otlphttp/grafana_cloud]metrics/servicegraph:receivers:[servicegraph]processors:[batch]exporters:[otlphttp/grafana_cloud]