Grafana Cloud

Sampling strategies for tracing

Sampling controls trace volume and cost by selecting which traces to retain. This page explains common strategies and how to configure them with Grafana Alloy or the OpenTelemetry Collector.

Sampling helps you control ingestion and storage costs. You can focus on high-value traces, such as errors, latency outliers, specific tenants/endpoints.

There are two main sampling strategies: head sampling and tail sampling.

  • Head sampling: decide at span start; low overhead; may miss rare errors/latency.
  • Tail sampling: decide after collecting spans for a period; can select based on outcome (status, duration, attributes).

Refer to Sampling and policies in the Tempo documentation for more information.

Tip

Grafana Cloud users can use Adaptive Traces for managed tail sampling without the operational overhead of a self-managed pipeline. Adaptive Traces handles large-scale tail sampling with policy updates in around 10 seconds.

Adaptive Traces (Grafana Cloud)

Beyond letting you create your own sampling policies using types like probabilistic, latency, status code, string attribute, and drop, Adaptive Traces includes capabilities that go further than manual policy configuration.

Opinionated recommendations

Adaptive Traces regularly analyzes your trace data and generates recommendations to fine-tune your sampling strategy. Recommendations typically suggest policies to keep error traces, capture high-latency traces, or retain a representative probabilistic sample. You can review, apply, or dismiss recommendations from the UI, and a history feed lets you correlate applied changes with shifts in ingestion volume or cost.

Diversity sampling

The diversity policy ensures rare, low-traffic, and unique traces are always captured. It builds a fingerprint from span attributes like service.name, http.route, and http.response.status_code, then guarantees at least one trace per fingerprint every 15 minutes. This means you can lower your probabilistic sampling rate to reduce costs while still maintaining visibility into edge cases, background jobs, and infrequently executed code paths.

Anomaly detection

Anomaly detection learns what “normal” looks like for your services, then monitors incoming traces for significant deviations such as latency spikes. When an anomaly is detected, Adaptive Traces automatically retains the relevant traces and creates a temporary sampling policy, surfacing problems you might not have known to look for. You can drill down directly from the anomaly to the specific traces in the affected time range.

Resilient pipeline

Self-managed tail sampling typically requires a centralized collector that becomes a single point of failure. If it goes down, sampling decisions are lost along with the traces they would have retained. Because Adaptive Traces runs as managed infrastructure within Grafana Cloud, availability, scaling, and fault tolerance are handled for you, so the sampling pipeline continues to operate reliably even during traffic spikes or infrastructure disruptions.

Common tail-sampling policies

  • Latency-based: keep traces with duration above a threshold.
  • Error-based: keep status_code != OK or HTTP 5xx.
  • Attribute-based: keep critical tenants, endpoints, or transaction types.
  • Probabilistic: sample a percentage for baseline coverage.

Combine policies to ensure broad coverage plus targeted retention of valuable traces.

Configuration references

Best practices

  • Decision wait period: ensure it fits typical trace durations; too long can delay spans (affects metrics-generation slack).
  • Batch timeouts/size: large buffers add latency; tune alongside sampling.
  • Composite/AND samplers: use to require multiple conditions; avoid unintentionally dropping most traces.
  • Span dropping vs trace sampling: span filtering can reduce noise without dropping the entire trace.