Java HTTP Metrics from OpenTelemetry Traces
As of today, traces are the most mature, most widely used, and most well-supported OpenTelemetry signal. In many SDKs, like for example OpenTelemetry’s Go SDK, metrics are still in beta while traces are stable.
So if you wanted to create a Grafana dashboard for monitoring HTTP services, it would be great if you could do this using OpenTelemetry traces rather than using metrics directly, because that way you could reuse the dashboard for a large number of frameworks and SDKs.
This tutorial shows how to generate HTTP metrics from trace data. We’ll use Java as an example, but the same approach will work with all OpenTelemetry compliant traces.
System Architecture
There are different components that can generate metrics from traces:
- The Grafana agent’s spanmetrics generator
- The OpenTelemetry collector’s spanmetrics connector
- Tempo’s built-in span-metrics processor
In this example we’ll go with Tempo’s built-in span-metrics
processor. The architecture looks as follows:
- We’ll run an example Java REST service and instrument it with OpenTelemetry’s Java instrumentation.
- The instrumentation will send Spans to Grafana Tempo, which is an open source trace database.
- Tempo will generate metrics from trace data and write these metrics to a Prometheus server. We are using Prometheus in this example. However, the same scenario also works with compatible Prometheus alternatives like Mimir.
- We’ll set up an example dashboard for visualizing the metrics in Grafana.
Set Up the Prometheus Server
The Prometheus setup does not require any specific configuration for our example.
However, we need to pass the --web.enable-remote-write-receiver
command line parameter to enable remote write,
because Tempo will use Prometheus’ remote write interface to push the generated metrics to the Prometheus server.
We’ll also enable Exemplars with the --enable-feature=exemplar-storage
feature flag to allow navigation from
metrics to traces.
You can either download the Prometheus server from the Github releases, or run the Docker image like this:
docker run --rm --name=prometheus --network=host -p 9090:9090 prom/prometheus:v2.37.7 --web.enable-remote-write-receiver --enable-feature=exemplar-storage --config.file=/etc/prometheus/prometheus.yml
After successful startup the Prometheus server’s Web interface will be accessible on http://localhost:9090.
Set up Tempo
The span-metrics
processor is not enabled in Tempo by default, so we create a file config.yaml
with the following content:
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: local
wal:
path: /tmp/tempo/wal
local:
path: /tmp/tempo/blocks
metrics_generator:
storage:
path: /tmp/tempo/generator/wal
remote_write:
- url: http://localhost:9090/api/v1/write
send_exemplars: true
overrides:
metrics_generator_processors: [span-metrics]
The server
, distributor
, and storage
sections are just a minimal example set-up. The interesting parts are the overrides
where the span-metrics
processor is enabled,
and the metrics_generator
section where we configure where to send the generated metrics to. Note that localhost:9090 is our Prometheus server.
Tempo provides a variety of config options for span-metrics
. See the Tempo documentation for reference.
You can either download Tempo from the Github releases, or run the Docker image.
With the config.yaml
in the current directory, the Docker command looks like this:
docker run --rm --name=tempo --network=host -v "$(pwd)/config.yaml:/config.yaml" -p 3200:3200 -p 4317:4317 -p 4318:4318 grafana/tempo:2.1.1 --config.file=/config.yaml
Set up Grafana
You can either download Grafana from the Github releases page, or run the Docker image.
Setting up Grafana is just the usual download, extract, run:
docker run --rm --name=grafana --network=host -p 3000:3000 grafana/grafana:9.5.1
The Grafana Web UI will start up on http://localhost:3000.
Log in to Grafana’s Web UI (user: admin, password: admin), and configure the Prometheus server and Tempo as data sources:
The Prometheus URL is http://localhost:9090.
The Tempo URL is http://localhost:3200.
Now get back to the Prometheus data source, and configure Tempo as the target for Exemplar links.
Instrument the Java Application
As an example, we implemented a simple Hello World REST service using Spring Boot 2. You can create the example as follows:
- Go to start.spring.io
- Select Spring Boot version 2.7.11.
- Select Java version 11.
- Add “Spring Web” as a dependency. This will add
spring-boot-starter-web
as a dependency. - Click “Generate” to download
demo.zip
. - Add the code below to the example in
src/main/java/com/example/demo/DemoApplication.java
. - Build with
./gradlew build
. - The application can be found in
build/libs/demo-0.0.1-SNAPSHOT.jar
The Java code does not include any explicit instrumentation:
@SpringBootApplication
@RestController
public class DemoApplication {
private final Random random = new Random(1);
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
@GetMapping("/")
public String sayHello() throws InterruptedException {
Thread.sleep(90);
if (random.nextInt(10) < 3) {
throw new RuntimeException("simulating an error");
}
return "Hello, World!\n";
}
}
In order to instrument our Java application, we attach the OpenTelemetry Java instrumentation.
Download opentelemetry-javaagent.jar
from the Github releases.
curl -OL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v1.27.0/opentelemetry-javaagent.jar
Configure the OpenTelemetry Java instrumentation via OpenTelemetry’s standard environment variables and run the demo service with the instrumentation attached.
export OTEL_SERVICE_NAME=hello-world-service
export OTEL_TRACES_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=none
export OTEL_LOGS_EXPORTER=none
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
java -javaagent:opentelemetry-javaagent.jar -jar ./build/libs/demo-0.0.1-SNAPSHOT.jar
Now access http://localhost:8080 a couple of times to produce some example telemetry.
Verify that Data is Available
Let’s use Grafana’s explore view to verify that data is available. Let’s start with traces.
- Log in to Grafana at http://localhost:3000. Default user is admin with password admin.
- Navigate to Explore in the menu on the left.
- Select Tempo as a data source.
- Use the Search form and select the
hello-world-service
as the service name. - Click the Run query button.
You should see example traces representing your test calls to http://localhost:8080.
Now let’s verify that Tempo successfully generated span metrics.
- Remain in the Explore view, but select Prometheus as the data source.
- In the query window, type
traces_spanmetrics_latency_count
. - Click the Run query button.
You should now see a metric representing the total number of your test calls to http://localhost:8080.
Note that there are plans to change the names of Tempo’s span metrics. If you cannot find the metric named
traces_spanmetrics_latency_count
, take a look at
Tempo’s metrics generator documentation
and verify the metric name.
Create a Dashboard for Monitoring the HTTP service
The most common methodology for creating a dashboard for an HTTP service is the RED method, which means the dashboard shows the request rate, the error rate, and the duration.
In Grafana in the menu on the left, click on “dashboards”, create a New dashboard, and start adding visualizations. The following sections show which visualizations and which queries to use.
Request Rate
The request rate per second can be calculated from the traces_spanmetrics_latency_count
with the following Prometheus query:
sum by (span_name) (rate(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]))
Note that a single request may create multiple spans, so we filter by span_kind="SPAN_KIND_SERVER
to make sure the Spans are unique.
For showing the request rate on a dashboard, use the time series visualization and configure “requests/sec” as the unit.
Note that requests to unknown URLs are represented as GET /**
by the OpenTelemetry instrumentation.
You’ll see this if you point your Web browser to localhost:8080, because the Web browser will
attempt to access localhost:8080/favicon.ico, which is an unknown URL in our REST service.
Error Rate
The basic query for the error rate looks like this:
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER", status_code="STATUS_CODE_ERROR"}[15m]))
/
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]))
This takes the number of erroneous requests in the past 15 minutes (the requests with status_code="STATUS_CODE_ERROR
),
and divides it by the total number of requests in the past 15 minutes. The result is the error rate in percent, represented as a number between 0 and 1.
However, there’s a caveat: If there are no errors, the filter status_code="STATUS_CODE_ERROR
won’t match anything,
and the query won’t produce results. That means, endpoints with 100% success rate will be missing.
The following is a PromQL trick to work around this:
(
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER", status_code="STATUS_CODE_ERROR"}[15m]))
or
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]) * 0)
) /
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]))
The or
in the nominator means that if there is no match with status_code="STATUS_CODE_ERROR"
, we use 0
(total number of requests times 0
).
That way, even endpoints with 100% success rate (0% error rate) will be included.
Note that the query will return NaN
if there was no traffic in the past 15m
, because in that case it’s a division by zero.
This is expected: If there is no traffic, you cannot calculate a meaningful error rate, and therefore NaN
is a reasonable representation.
To visualize the error rate on a dashboard, choose the time series visualization with unit Percent (0.0-1.0).
Duration
For the latency example, we use the histogram_quantile() function to calculate the 95th percentile. The query looks like this:
histogram_quantile(0.95, sum by (span_name, le) (rate(traces_spanmetrics_latency_bucket{span_kind="SPAN_KIND_SERVER"}[15m])))
To visualize the 95th percentile, choose the time series visualization with unit seconds.
The dots on the duration visualization are Exemplars. Clicking on an Exemplar provides a link for querying an example Trace with Tempo.
Summary
This tutorial showed a complete example of how to set up a dashboard for monitoring HTTP services purely based on metrics generated from trace data. As distributed tracing is the most mature signal in OpenTelemetry, this approach will work across a wide range of SDKs and frameworks.