Tracing with the Grafana Cloud Agent and Grafana Tempo
Back in March, we introduced the Grafana Cloud Agent, a subset of Prometheus built for hosted metrics. It uses a lot of the same battle-tested code as Prometheus and can save 40 percent on memory usage.
Ever since the launch, we’ve been adding features to the Agent. Now, there’s a clustering mechanism, additional Prometheus exporters, and support for Loki.
Our latest feature: Grafana Tempo! It’s an easy-to-operate, high-scale, and cost-effective distributed tracing system.
In this post, we’ll explore how you can configure the Agent to collect traces and ship them to Tempo.
Configuring Tempo support
Adding tracing support to your existing Agent configuration file is simple. All you need to do is add a
tempo block. Those familiar with the OpenTelemetry Collector may recognize some of the settings in the following code block:
# other Agent settings tempo: receivers: jaeger: protocols: thrift_compact: attributes: actions: - action: upsert key: env value: prod push_config: endpoint: tempo-us-central1.grafana.net:443 basic_auth: username: 12345 # Replace <Grafana API Key> below with an API key that has the "Metrics Publisher" role password: <Grafana API Key>
While the OpenTelemetry Collector allows you to configure metrics and logging receivers, we’re currently only exposing tracing-related receivers. We believe that the existing Prometheus and Loki support within the Agent will meet the needs for the other pillars observability.
If you want, you can configure the Agent to accept data from every single receiver:
tempo: # The keys configure enabling a receiver or its protocol. Setting # it to an empty value enables the default configuration for # that receiver or protocol. receivers: # Configure jaeger support. grpc supports spans over port # 14250, thrift_binary over 6832, thrift_compact over 6831, # and thrift_http over 14268. Specific port numbers may be # customized within the config for the protocol. jaeger: protocols: grpc: thrift_binary: thrift_compact: thrift_http: # Configure opencensus support. Spans can be sent over port 55678 # by default. opencensus: # Configure otlp support. Spans can be sent to port 55680 by # default. otlp: protocols: grpc: http: # Configure zipkin support. Spans can be sent to port 9411 by # default. zipkin:
Attributes, on the other hand, enable operators to manipulate tags on incoming spans sent to the Grafana Cloud Agent. This is really useful when you want to add a fixed set of metadata, such as noting an environment:
attributes: actions: - action: upsert key: env value: prod
The example config above sets an “env” tag to all received spans with a value of “prod.” The “upsert” action means that a span with an existing “env” tag will have its value overwritten. That is useful for guaranteeing you’ll know which Agent received a span and which environment it was running in.
Attributes are really powerful and support use cases beyond the example here. Check out OpenTelemetry’s documentation on them for more information.
But at Grafana Labs, we didn’t just use a subset of the OpenTelemetry Collector and call it a day; we’ve added support for Prometheus-style
scrape_configs that can be used to automatically tag incoming spans based on metadata for discovered targets.
Attaching metadata with Prometheus Service Discovery
Promtail is a logging client used to collect logs and send them to Loki. One of its most powerful capabilities is its support for using Prometheus' service discovery mechanisms. These service discovery mechanisms enable you to attach the same metadata to your logs as your metrics.
When your metrics and logs have the same metadata, you lower the cognitive overhead in switching between systems, and it gives the “feel” of all of your data being stored in one system. We wanted this capability to be extended to tracing, as well.
Joe Elliott added the same Prometheus Service Discovery mechanisms within the Agent’s tracing subsystem. It works by matching the IP address of the system sending spans to the address of a discovered Service Discovery target.
For Kubernetes users, this means that you can dynamically attach metadata for namespace, pod, and container name of the container sending spans:
tempo: receivers: jaeger: protocols: thrift_compact: scrape_configs: - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod - source_labels: [__meta_kubernetes_pod_container_name] target_label: container tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: false # push_config, etc
Screenshot of a span with Kubernetes Service Discovery Metadata attached
This feature isn’t just useful for Kubernetes users, however. All of Prometheus' various service discovery mechanisms are supported here. This means you can use the same
scrape_configs between your metrics, logs, and traces to get the same set of labels, and easily transition between your observability data when moving from your metrics, logs, and traces.
Configuring how spans are pushed
Of course, just collecting spans isn’t very useful! The final part to configuring Tempo support is through the
push_config describes a
remote_write-like configuration block to control where collected spans are sent.
For the curious, this is a wrapper for the OpenTelemetry Collector’s OTLP exporter. Since the Agent exports OTLP-formatted spans, it means you can send the spans to any system that supports OTLP data. We’re focused on Tempo today, but you could even have the Agent send spans to another OpenTelemetry Collector.
Aside from the endpoint and authentication,
push_config allows you to control the batching, queueing, and retry functionality of spans. Batching controls allow for better compression of spans and reduces the number of outgoing connections used for transmitting data to Tempo. As before, OpenTelemetry has some pretty good documentation on this.
tempo: push_config: endpoint: tempo-us-central1.grafana.net:443 basic_auth: username: 12345 password: api_key # Batch settings for spans. Finish a batch after collecting # 10,000 spans or after 10s, whichever comes first. batch: send_batch_size: 10000 timeout: 10s
On the other side, queues and retries allow you to configure how many batches will be kept in memory and how long you will retry a batch if it happens to fail. These settings are identical to the
sending_queue settings from OpenTelemetry’s OTLP exporter:
tempo: push_config: endpoint: tempo-us-central1.grafana.net:443 basic_auth: username: 12345 password: api_key # Double the default queue size to keep more batches # in memory but give up on retrying failed spans after 5s. sending_queue: queue_size: 10000 retry_on_failure: max_elapsed_time: 5s
While it may be tempting to set the max retry time very high, it can quickly become dangerous. Retries will increase the total amount of network traffic from the Agent to Tempo, and it may be better to drop spans rather than continually retrying. Another risk is with memory usage: If your backend were to have an outage, a high retry time will quickly fill up span queues and may topple over the agent with an Out Of Memory error.
Since it’s not practical for 100 percent of the spans to be stored for a system with a lot of span throughput, controlling the batching, queue, and retry logic to meet your specific network usage will be crucial for effective tracing.
See you next time!
We’ve touched on how to manually configure the Grafana Cloud Agent for tracing support, but for a practical example, check out the production-ready tracing Kubernetes manifest. This manifest comes with a configuration that touches on everything here, including the Service Discovery mechanism to automatically attach Kubernetes metadata to incoming spans.
I’d like to extend a huge thanks to Joe for taking time out of his very busy schedule with Tempo to add tracing support within the Agent. I’m really excited that the Grafana Cloud Agent now supports most of the Grafana stack, and I’m more excited for what’s to come down the line!
Related Case Studies
The company relies on Grafana to be the consolidated data visualization and dashboard solution for sharing data.
For Hiya, one of the key selling points was the fact that Grafana Cloud is powered by Cortex.