Grafana Labs logo
Search icon
Capture high-value traces without managing a pipeline: Tail sampling with Adaptive Traces

Capture high-value traces without managing a pipeline: Tail sampling with Adaptive Traces

2025-12-1810 min
Twitter
Facebook
LinkedIn

Tracing is the richest observability signal in common use today. In distributed systems, it reveals how requests flow across multiple services, allowing you to uncover and address performance bottlenecks. 

Teams often scale back or abandon tracing altogether, however, because most successful requests produce redundant data that’s noisy and expensive to store.

Earlier this year, we introduced Adaptive Traces, which is part of our end-to-end Adaptive Telemetry suite in Grafana Cloud, to offer a better path. By storing only high-value traces, such as those with errors, elevated latency, or other critical signals, it delivers faster insights while reducing spend. 

“Before Adaptive Traces, we had two bad options: send everything and blow our budget, or send so little we couldn’t get meaningful insight,” said Geoff Schultz, Manager, Infrastructure Engineering at Auditboard. “Now tracing is actually usable. We can dial sampling up or down as needed, keep costs in check, and still give teams the visibility they need.” 

A big part of how Adaptive Traces achieves this is through an observability technique called tail sampling — where the decision to sample (aka, keep) or drop a trace is made after collecting all or most of its spans — without requiring you to maintain and manage a sampling pipeline yourself. 

Video

In this post, we’ll explore how tail sampling works, how it differs from head sampling, and how Adaptive Traces makes it easy for teams to implement tail sampling and see immediate value.

The age-old question: heads or tails?

Sampling, in general, refers to the practice of selectively capturing only a subset of tracing data to reduce the amount of data being ingested and optimize costs. When it comes to how traces are sampled, however, there are tradeoffs. You’re balancing how much you store, whether you keep the traces you actually need, and how much complexity you’re willing to run just to make sampling work.

Most users start with head sampling, where they configure a percentage or rate of traces to keep. That sampling decision is made when a trace is created and propagated throughout the system. This approach is simple to implement and lowers tracing costs. The tradeoff, however, is that the traces you care about most, like rare errors or slow requests, can get buried under a flood of successful requests. You can increase sampling rates to improve your odds, but your costs increase with it. 

This is where tail sampling comes in. Instead of deciding whether to keep or drop a trace when a trace is first created, tail sampling waits until the trace has collected all or most of its spans, then makes the call.

This allows you to implement logic that can capture more errors and slow requests, and even surface anomalous traces, so you can dig deeper when performance starts to drift. Tail sampling also lets you lightly sample during normal operations, then automatically retain the traces that matter when an issue hits.

The diagram below helps illustrate these differences. It compares uniform head sampling (at 10% of traces) to a tail sampling approach that uses a 5% base rate, but increases sampling to 50% for traces slower than the 2.5-second SLO. With tail sampling, you keep fewer fast traces and capture more of the slow ones that are most likely to explain what went wrong. This results in a lower overall sampling rate and a tighter focus on the signals that matter.

Two plots comparing latency distributions: left Head Sampling (blue all traces, red sampled), right Tail Sampling (blue all, green sampled).

Where head sampling falls short

Most teams treat sampling like a cost lever. The real test is whether you have the evidence you need when an alert fires, especially when the failure only hits a small slice of traffic.

You’ll always need accurate percentiles when building metrics from traces, and you’ll also need enough detail to debug what’s going wrong when an SLO or alert fires. During a full outage, head sampling is usually sufficient because most requests fail, and even a uniform sample will capture those failures. In a partial failure, however, only a small percentage of requests are impacted. A head sampling policy may not capture them, making the issue harder to diagnose. 

Balancing these considerations is the need for low operational burden. Tail sampling can solve the partial failure problem, but most teams don’t want to maintain a growing rule set or run additional infrastructure to decide which traces to keep. The ideal system keeps the right traces automatically, without adding complexity.

Three common sampling approaches in the market

Most vendors support head sampling and usually offer ways to adjust metric generation based on the sampling rate to keep metrics accurate. However, as mentioned above, when only a small slice of traffic is experiencing an outage, uniform head sampling can miss important traces. This forces a choice: increase the sampling rate and pay for extra noise, or keep it low and compromise visibility.

Another common approach is self-hosted tail sampling, where a user is required to send all of their traces to a self-hosted system where tail sampling rules are applied. This offers greater precision and control, allowing you to define rules, such as sampling traces that are outside of an SLO, but requires you to run and maintain the sampling pipeline yourself. This can create a lot of operational overhead, especially as services scale and change.

Finally, there are pipeline solutions that ingest traces (and usually other telemetry), and modify the data according to a centralized policy. The tradeoff of these solutions is generally reasonable: offload the operational complexity of a self-hosted tail sampling solution at the expense of the pipeline cost. However, these are general pipelines that require teams to maintain complicated rules to realize value.

Ultimately, the goal isn’t just tail sampling. The goal is to keep the right traces when things go awry, and avoid inheriting a new operational burden.

Adaptive Traces: automatically capture your most valuable traces

Adaptive Traces brings tail sampling into Grafana Cloud as a fully managed capability, so you can capture high-value traces and control spend without running a distributed sampling pipeline yourself.

In addition, Adaptive Traces addresses another common issue with tail sampling: if you downsample traces (aka, reduce the volume of tracing data) before generating metrics, the resulting rates will not be representative anymore. Many tail sampling policies skew percentiles and ratios, making the resulting metrics unreliable. Adaptive Traces avoids this by generating metrics from raw trace data before any downsampling. You can store fewer traces, keep the ones you care about, and still trust the span metrics you build from your full trace volume.

Recommendations and custom policies 

Getting started with tail sampling can be daunting, so to help with this, Adaptive Traces recommends three base policies to provide a well-rounded sampling setup, helping to cut down on volume while retaining slow and error traces. You can also define flexible and customizable sampling policies to capture only the tracing data you need. For example, you might configure rules to ingest all traces with an error status from a particular service, or to capture a random 5% of traces from another.

Adaptive Traces dashboard showing 41.3 MiB ingest, 2.18 MiB tempo ingest, 94.7% reduction, sampling-by-policies graph and a recommendation card.

As Adaptive Traces analyzes your data, it will surface additional recommendations. In the end, though, the user is always in control: all policies are accessible in Grafana Cloud and any changes are active in the ingestion infrastructure within 10 seconds.

“Day to day, we keep sampling restrictive to control spend,” said Auditboard’s Geoff Schultz. “When a team is troubleshooting, we can ingest useful traces for just their namespace, give them the full detail they need, and then return to baseline volume, without risking a runaway bill.” 

Anomaly detection 

Adaptive Traces continuously analyzes ingested traces and sample anomalies, and uses machine learning forecasts on span metrics to detect when an operation is anomalous. If an anomaly is detected, it will automatically create a temporary policy that samples traces exceeding the predicted P90 latency or error rate. This process ensures the most interesting traces are there when your team needs them most. 

A screenshot of an anomaly policy, including policy details related to sampled traces per minute and span latency.

Adaptive Traces will mark the anomalous traces with the attribute grafana.adaptivetraces.anomaly, which makes them easy to find using either the TraceQL query language or Grafana Traces Drilldown, an application that lets you explore tracing data through a queryless, point-and-click experience. 

Compatibility with OpenTelemetry 

Adaptive Traces uses the OpenTelemetry tail sampling processor, which is the same code that runs in Grafana Alloy and the OpenTelemetry Collector. By using the upstream tail sampling processor, we can run the same policies as Alloy directly in Grafana Cloud. This also means that policy configuration in Adaptive Traces will be familiar to those who already use tail sampling in Alloy or the OpenTelemetry Collector.

To uphold our stringent reliability standards, we run the processor inside a durable distributed sampling pipeline. And since we run inside of Grafana Cloud, we can create integrations with other services — for example, for anomaly detection we query RED metrics and use machine learning forecasts.

Getting started with Adaptive Traces

To get started with Adaptive Traces, write OpenTelemetry Protocol (OTLP) traces to Grafana Cloud. No special configuration is needed to use Adaptive Traces, as it is integrated directly into Grafana Cloud Traces.

Next, navigate to the Adaptive Traces plugin in Grafana Cloud, which you can find in the side bar in the Adaptive Telemetry menu.

Grafana app sidebar menu showing navigation items; Adaptive Telemetry expanded with "Adaptive Traces" highlighted and several "New!" badges.

When you open the plugin for the first time, it will automatically set up three base policies that will reduce your data volume, while retaining slow and error traces. You can then build on these policies to retain must-keep paths and other essential traces, according to your business needs. 

To learn more, please reference our best practices guides for setting up policies and onboarding a group of services. If your infrastructure has low-value traces, you can also drop these completely to reduce noise.

You can get started today in all tiers of Grafana Cloud, including in our free tier.

FAQ: Adaptive Traces in Grafana Cloud

What is Adaptive Traces?

Adaptive Traces is a Grafana Cloud feature that uses tail sampling to keep only your most valuable traces (for example, those with errors, high latency, or other critical signals) and drop the rest, reducing cost and noise while preserving visibility into application performance and availability.

Adaptive Traces is part of our end-to-end Adaptive Telemetry suite in Grafana Cloud. 

What are the key benefits of Adaptive Traces?

  • Lower observability costs: By only storing traces of importance, you reduce costs while maintaining necessary visibility. We typically observe customers reducing their ingested trace volumes by 75-90%.
  • Reduce MTTR: Ensure every critical trace is preserved with full context, so during incidents you have exactly the data you need. Generate accurate metrics from ingested traces to observe your applications with confidence.
  • Managed tail sampling at scale: Get a durable, easy-to-use tail sampling pipeline without the operational overhead of managing it yourself.
  • Fine-grained control via policies: Define flexible policies for precise control over what is kept or dropped. 
  • Anomaly-aware sampling and closed-loop investigations: Adaptive Traces automatically retains a sample of anomalous traces and surfaces them as part of the investigative workflow.

Is Adaptive Traces available in the Grafana Cloud free tier?

Yes, Adaptive Traces is available in all tiers of Grafana Cloud, including the free tier.

What else is included in the Adaptive Telemetry suite?

In addition to Adaptive Traces, the Adaptive Telemetry suite in Grafana Cloud consists of the following features, spanning all core pillars of observability:

  • Adaptive Metrics: Helps you identify and eliminate unused or partially used time series data through aggregation.
  • Adaptive Logs: Reduces log volume and associated costs by automatically identifying and removing low-value logs that are rarely or never used.
  • Adaptive Profiles: Dynamically adjusts data collection based on workload behavior, allowing you to deploy continuous profiling more broadly across your infrastructure without incurring excessive costs.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

Tags

Related content