Menu
Grafana Cloud

What are traces?

A user on your website enters their email address into a form to sign up for your mailing list. They click Enter.

The user’s email address is data that flows through your system. In a cloud computing world, it is possible that clicking that one button causes data to touch multiple nodes across your cluster of microservices.

The email address may be sent to a verification algorithm sitting in a microservice that exists solely for that purpose. If it passes the check, the information is stored in a database.

Along the way, an anonymization node strips personally identifying data from the address and sends metadata collected to a marketing qualifying algorithm to determine whether the request was sent from a targeted part of the internet.

Services respond and data flows back from each, sometimes triggering new events across the system. Along the way, logs are written in various nodes with a time stamp showing when the info passed through.

Finally, the request and response activity ends and a record of that request is sent to Grafana Cloud.

Traces versus metrics and logs

Each observability signal plays a unique role in providing insights into your systems. Metrics act as the high-level indicators of system health. They alert you that something is wrong or deviating from the norm. Logs then help you understand what exactly is going wrong, for example, the nature or cause of the elevated error rates you’re seeing in your metrics. Traces illustrate where in the sequence of events something is going wrong. They let you pinpoint which service in the many services that any given request traverses is the source of the delay or the error.

Let’s say a server takes too long to send data. Your metrics that track the latency of your system will increase, and they may then trigger an alert once that latency rises outside of an acceptable threshold.

Sending that data likely requires that a request interact with many different services in your system. Traces help you pinpoint the specific service that’s introducing the added latency that you’re seeing in your metrics. Alternatively, if you’re seeing an elevated rate of errors when sending data, traces help you figure out from which service the errors are originating from.

Logs provide a granular view of what exactly is going wrong. For example, there could be multiple connection refused errors in your log lines. This explains why the email server took too long to send data.

Grafana Cloud Traces

Grafana Cloud Traces is based on Tempo, an open-source, easy-to-use, and high-scale distributed tracing backend. Tempo is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can be used with any of the open-source tracing protocols, including Jaeger, Zipkin, and OpenTelemetry.

Grafana Cloud Traces lets you search for traces, generate metrics from spans, and link your tracing data with logs and metrics.

A deeper introduction to Tempo

Grafana Tempo is a high volume distributed tracing backend that can retrieve a trace when queried for the trace-id. It builds an index on the high cardinality trace-id field and uses an object store as backend which allows for high parallelization of queries. Read more about this in the architecture section of the docs.

Tempo has strong integrations with a number of existing open source tools, including:

  • Grafana. Grafana ships with native support for Tempo using the built-in Tempo data source.
  • Grafana Loki. Loki, with its powerful query language LogQL v2 allows us to filter down on requests that we care about, and jump to traces using the Derived fields support in Grafana.
  • Prometheus exemplars. Exemplars let you jump from Prometheus metrics to Tempo traces by clicking on recorded exemplars. Read more about this integration in this blog post.

Search for traces

Sample search visualization

Search for traces using common dimensions such as time range, duration, span tags, service names, etc. Use the trace view to quickly diagnose errors and high latency events in your system.

Refine your search using TraceQL

Inspired by PromQL and LogQL, TraceQL is a query language designed for selecting traces.

The default traces search reviews the whole trace. TraceQL provides a method for formulating precise queries so you can zoom in to the data you need. Query results are returned faster because the queries limit what is searched.

If you are using Cloud Traces, you can construct queries using the TraceQL query editor or use the the Search query type (preview feature).

Note

The traceqlEditor feature flag needs to be enabled to access the TraceQL editor in Grafana Cloud. Contact Grafana Support to open a ticket to enable this feature.

For details about how queries are constructed, read the TraceQL documentation.

Metrics from spans

RED metrics can be used to drive service graphs and other ready-to-go visualizations of your span data. RED metrics represent:

  • Rate, the number of requests per second
  • Errors, the number of those requests that are failing
  • Duration, the amount of time those requests take

For more information about RED method, refer to The RED Method: How to instrument your services.

Metrics generation is disabled by default. Contact Grafana Support to enable metrics generation for your organization.

Service graph view

These metrics exist in your Hosted Metrics instance and can also be easily used to generate powerful custom dashboards.

Custom Metrics Dashboard

Metrics automatically generate exemplars as well which allows easy metrics to trace linking. Exemplars are GA in Grafana Cloud so you can also push your own.

Trace Exemplars

Service graph view

Service graph view displays a table of request rate, error rate, and duration metrics (RED) calculated from your incoming spans. It also includes a node graph view built from your spans. To use the service graph view, you need to enable service graphs and span metrics. Once enabled, this pre-configured view is immediately available in Explore > Service Graphs.

See service graph view documentation for further explanation of this view and how to enable it.

Service graph view overview

If you’re already doing request/response logging with trace IDs, they can be easily extracted from logs to jump directly to your traces.

Logs to Traces visualization

In the other direction, you can configure Grafana Cloud to create a link from an individual span to your Loki logs. If you see a long-running or errored span, you can immediately jump to the logs of the process causing the error.

Traces to Logs visualization

Refer to Set up and use tracing to get started.

Note

Cloud Traces only supports custom tags added by Grafana Support. Cloud Traces supports these default tags: cluster, hostname, namespace, and pod. Contact Support to add a custom tag.

Grafana can correlate different signals by adding the functionality to link between traces and metrics. The trace to metrics feature, a beta feature in Grafana 9.1, lets you quickly see trends or aggregated data related to each span.

You can try it out by enabling the traceToMetrics feature toggle in your Grafana configuration file.

For example, you can use span attributes to metric labels by using the $__tags keyword to convert span attributes to metrics labels.

For more information, refer to the trace to metric configuration documentation.

Using Trace to profiles, you can use Grafana’s ability to correlate different signals by adding the functionality to link between traces and profiles. Refer to the relevant documentation for configuration instructions.

Selecting a link in the span queries the profile data source