Trace discovery in Grafana Tempo using Prometheus exemplars, Loki 2.0 queries, and more

Published: 9 Nov 2020

Grafana Tempo, the recently announced distributed tracing backend, relies on integrations with other data sources for trace discovery. Tempo’s job is to store massive numbers of traces, place them in object storage, and retrieve them by id. Logs and exemplars allow users to quickly and more powerfully jump directly to traces than ever before.

Let’s dig into some examples with a live playground to try it out!

TNS Demo

The TNS demo is a commonly used playground/example application to test and demo basic Grafana, Loki, Prometheus and Tempo features. Let’s walk through some examples using it. Follow the main readme to install prerequisites and then set up the cluster. Then navigate to http://localhost:8080, click the “Grafana” link, and let’s get started.

Loki 2.0

Example

Loki 2.0 has some amazing new query features that you really should try out. These improvements are great on their own, but they also have amazing implications for trace search with Tempo as well.

In the TNS Demo Grafana, navigate to Explore and choose Loki as your data source. Let’s start with a simple query:{job="tns/app", level="info"}. This will return a number of log lines like:

2020-11-06T15:02:10.261121224Z stdout F level=info msg="HTTP client success" status=200 url=http://db duration=1.03636ms traceID=fb0fbe73200e474
2020-11-06T15:02:10.014657751Z stdout F level=info msg="HTTP client success" status=200 url=http://db duration=2.116557ms traceID=2c963a78f1ee0c78
2020-11-06T15:02:09.98055353Z stdout F level=info msg="HTTP client success" status=200 url=http://db duration=2.24091ms traceID=7efd169fbc41ff4a

We could then click on these trace ids and jump straight to Tempo:

But what if we only wanted to see traces that failed? Or with certain latencies? This was possible in Loki 1.x, but often required tricky and brittle regex searches. Check out how easy this is in Loki 2.0:

{job="tns/app", level="info"} | logfmt | status >= 500 and status <= 599 and duration > 50ms

The logfmt pipe operator parses the formatted line and allows us to search based on the value of the fields. How cool is that! You can now log any value alongside a trace id and use it to index your traces.

Configuration

All of the above features are available in current Grafana, Loki, and Tempo builds. The only other notable piece of configuration is setting up a Loki Derived Field to create a link from the Trace ID. This can be viewed in the data source config in the example:

Exemplars

Example

Exemplars are being worked on as we speak. Grafana support is expected in 7.3.x, and Prometheus support is coming soon. Note that this example uses some custom images built off of feature branches. Expect them in master soon!

In the TNS Demo Grafana, navigate to Explore and choose the prometheus-exemplars data source. Let’s try this query:

histogram_quantile(.99, sum(rate(tns_request_duration_seconds_bucket{}[1m])) by (le))

Executing this query should show the p99 of this histogram along with some exemplars:

We can mouse over any dot and click it to jump straight from this metric over to a trace. If we were only interested in failing requests we could try:

histogram_quantile(.99, sum(rate(tns_request_duration_seconds_bucket{status_code="500"}[1m])) by (le))

And now every exemplar is only those requests that were aggregated to create this metric; i.e., they are all failed requests. Note that currently exemplars are enabled only for the latency histograms, so you should only see them for tns_request_duration_seconds_bucket.

Configuration

Exemplars do require some not-yet-released features. Note that the Prometheus and Grafana images are not from master. Also, the following exemplar-linking configuration exists:

Don’t fret about this too much, though! The example sets it all up for you nicely! Expect these features soon in these open source applications.

Trace discovery

Somehow, even though Tempo does not support native search, trace discovery is more powerful and easier than ever! Use logs to build a perfectly crafted index into your traces with the fields and values that work for you. Use exemplars on-the-fly to discover traces related directly to the issue you are currently triaging with just a single click.

If these ideas excite you, join us in the #tempo channel in the Grafana public slack or hop on over to the repo and let us know what you think! You can also watch our ObservabilityCon session, “Tracing made simple with Grafana," on demand or request access to the private beta of Tempo on Grafana Cloud here.

Related Posts

During the keynote today, we made some exciting announcements (Grafana Tempo! Loki 2.0!). Here's where you can find out more.
Learn about scaling Prometheus, implementing distributed tracing, monitoring your network in Minecraft, and much more!
At FOSDEM 2020, Grafana Labs full stack developer Andrej Ocenas talked about one of the company's big goals: to make Grafana into a full observability platform where users can see their metrics, logs, and traces, correlate data between them, and quickly solve their issues.