Menu
OpenTelemetry Instrumentation Java Java HTTP metrics from OpenTelemetry traces
Open source

Java HTTP Metrics from OpenTelemetry Traces

As of today, traces are the most mature, most widely used, and most well-supported OpenTelemetry signal. In many SDKs, like for example OpenTelemetry’s Go SDK, metrics are still in beta while traces are stable.

So if you wanted to create a Grafana dashboard for monitoring HTTP services, it would be great if you could do this using OpenTelemetry traces rather than using metrics directly, because that way you could reuse the dashboard for a large number of frameworks and SDKs.

This tutorial shows how to generate HTTP metrics from trace data. We’ll use Java as an example, but the same approach will work with all OpenTelemetry compliant traces.

System Architecture

There are different components that can generate metrics from traces:

In this example we’ll go with Tempo’s built-in span-metrics processor. The architecture looks as follows:

Span metrics demo architecture

  • We’ll run an example Java REST service and instrument it with OpenTelemetry’s Java instrumentation.
  • The instrumentation will send Spans to Grafana Tempo, which is an open source trace database.
  • Tempo will generate metrics from trace data and write these metrics to a Prometheus server. We are using Prometheus in this example. However, the same scenario also works with compatible Prometheus alternatives like Mimir.
  • We’ll set up an example dashboard for visualizing the metrics in Grafana.

Set Up the Prometheus Server

The Prometheus setup does not require any specific configuration for our example. However, we need to pass the --web.enable-remote-write-receiver command line parameter to enable remote write, because Tempo will use Prometheus’ remote write interface to push the generated metrics to the Prometheus server. We’ll also enable Exemplars with the --enable-feature=exemplar-storage feature flag to allow navigation from metrics to traces.

You can either download the Prometheus server from the Github releases, or run the Docker image like this:

shell
docker run --rm --name=prometheus --network=host -p 9090:9090 prom/prometheus:v2.37.7 --web.enable-remote-write-receiver --enable-feature=exemplar-storage --config.file=/etc/prometheus/prometheus.yml

After successful startup the Prometheus server’s Web interface will be accessible on http://localhost:9090.

Set up Tempo

The span-metrics processor is not enabled in Tempo by default, so we create a file config.yaml with the following content:

yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:

storage:
  trace:
    backend: local
    wal:
      path: /tmp/tempo/wal
    local:
      path: /tmp/tempo/blocks

metrics_generator:
  storage:
    path: /tmp/tempo/generator/wal
    remote_write:
      - url: http://localhost:9090/api/v1/write
        send_exemplars: true

overrides:
  metrics_generator_processors: [span-metrics]

The server, distributor, and storage sections are just a minimal example set-up. The interesting parts are the overrides where the span-metrics processor is enabled, and the metrics_generator section where we configure where to send the generated metrics to. Note that localhost:9090 is our Prometheus server. Tempo provides a variety of config options for span-metrics. See the Tempo documentation for reference.

You can either download Tempo from the Github releases, or run the Docker image.

With the config.yaml in the current directory, the Docker command looks like this:

shell
docker run --rm --name=tempo --network=host -v "$(pwd)/config.yaml:/config.yaml" -p 3200:3200 -p 4317:4317 -p 4318:4318 grafana/tempo:2.1.1 --config.file=/config.yaml

Set up Grafana

You can either download Grafana from the Github releases page, or run the Docker image.

Setting up Grafana is just the usual download, extract, run:

shell
docker run --rm --name=grafana --network=host -p 3000:3000 grafana/grafana:9.5.1

The Grafana Web UI will start up on http://localhost:3000.

Log in to Grafana’s Web UI (user: admin, password: admin), and configure the Prometheus server and Tempo as data sources:

The Prometheus URL is http://localhost:9090.

Prometheus Data Source Configuration

The Tempo URL is http://localhost:3200.

Tempo Data Source Configuration

Now get back to the Prometheus data source, and configure Tempo as the target for Exemplar links.

Prometheus Data Source Exemplar Configuration

Instrument the Java Application

As an example, we implemented a simple Hello World REST service using Spring Boot 2. You can create the example as follows:

  • Go to start.spring.io
  • Select Spring Boot version 2.7.11.
  • Select Java version 11.
  • Add “Spring Web” as a dependency. This will add spring-boot-starter-web as a dependency.
  • Click “Generate” to download demo.zip.
  • Add the code below to the example in src/main/java/com/example/demo/DemoApplication.java.
  • Build with ./gradlew build.
  • The application can be found in build/libs/demo-0.0.1-SNAPSHOT.jar

The Java code does not include any explicit instrumentation:

java
@SpringBootApplication
@RestController
public class DemoApplication {

    private final Random random = new Random(1);

    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }

    @GetMapping("/")
    public String sayHello() throws InterruptedException {
        Thread.sleep(90);
        if (random.nextInt(10) < 3) {
            throw new RuntimeException("simulating an error");
        }
        return "Hello, World!\n";
    }
}

In order to instrument our Java application, we attach the OpenTelemetry Java instrumentation. Download opentelemetry-javaagent.jar from the Github releases.

shell
curl -OL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v1.27.0/opentelemetry-javaagent.jar

Configure the OpenTelemetry Java instrumentation via OpenTelemetry’s standard environment variables and run the demo service with the instrumentation attached.

shell
export OTEL_SERVICE_NAME=hello-world-service
export OTEL_TRACES_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=none
export OTEL_LOGS_EXPORTER=none
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317

java -javaagent:opentelemetry-javaagent.jar -jar ./build/libs/demo-0.0.1-SNAPSHOT.jar

Now access http://localhost:8080 a couple of times to produce some example telemetry.

Verify that Data is Available

Let’s use Grafana’s explore view to verify that data is available. Let’s start with traces.

  • Log in to Grafana at http://localhost:3000. Default user is admin with password admin.
  • Navigate to Explore in the menu on the left.
  • Select Tempo as a data source.
  • Use the Search form and select the hello-world-service as the service name.
  • Click the Run query button.

You should see example traces representing your test calls to http://localhost:8080.

Screenshot of the Tempo Explore View

Now let’s verify that Tempo successfully generated span metrics.

  • Remain in the Explore view, but select Prometheus as the data source.
  • In the query window, type traces_spanmetrics_latency_count.
  • Click the Run query button.

You should now see a metric representing the total number of your test calls to http://localhost:8080.

Screenshot of the Prometheus Explore View

Note that there are plans to change the names of Tempo’s span metrics. If you cannot find the metric named traces_spanmetrics_latency_count, take a look at Tempo’s metrics generator documentation and verify the metric name.

Create a Dashboard for Monitoring the HTTP service

The most common methodology for creating a dashboard for an HTTP service is the RED method, which means the dashboard shows the request rate, the error rate, and the duration.

In Grafana in the menu on the left, click on “dashboards”, create a New dashboard, and start adding visualizations. The following sections show which visualizations and which queries to use.

Request Rate

The request rate per second can be calculated from the traces_spanmetrics_latency_count with the following Prometheus query:

sum by (span_name) (rate(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]))

Note that a single request may create multiple spans, so we filter by span_kind="SPAN_KIND_SERVER to make sure the Spans are unique. For showing the request rate on a dashboard, use the time series visualization and configure “requests/sec” as the unit.

Screenshot of the Request Rate Visualization

Note that requests to unknown URLs are represented as GET /** by the OpenTelemetry instrumentation. You’ll see this if you point your Web browser to localhost:8080, because the Web browser will attempt to access localhost:8080/favicon.ico, which is an unknown URL in our REST service.

Error Rate

The basic query for the error rate looks like this:

sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER", status_code="STATUS_CODE_ERROR"}[15m]))
/
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]))

This takes the number of erroneous requests in the past 15 minutes (the requests with status_code="STATUS_CODE_ERROR), and divides it by the total number of requests in the past 15 minutes. The result is the error rate in percent, represented as a number between 0 and 1.

However, there’s a caveat: If there are no errors, the filter status_code="STATUS_CODE_ERROR won’t match anything, and the query won’t produce results. That means, endpoints with 100% success rate will be missing.

The following is a PromQL trick to work around this:

(
    sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER", status_code="STATUS_CODE_ERROR"}[15m]))
    or
    sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]) * 0)
) /
sum by (span_name) (increase(traces_spanmetrics_latency_count{span_kind="SPAN_KIND_SERVER"}[15m]))

The or in the nominator means that if there is no match with status_code="STATUS_CODE_ERROR", we use 0 (total number of requests times 0). That way, even endpoints with 100% success rate (0% error rate) will be included.

Note that the query will return NaN if there was no traffic in the past 15m, because in that case it’s a division by zero. This is expected: If there is no traffic, you cannot calculate a meaningful error rate, and therefore NaN is a reasonable representation.

To visualize the error rate on a dashboard, choose the time series visualization with unit Percent (0.0-1.0).

Screenshot of the Error Rate Visualization

Duration

For the latency example, we use the histogram_quantile() function to calculate the 95th percentile. The query looks like this:

histogram_quantile(0.95, sum by (span_name, le) (rate(traces_spanmetrics_latency_bucket{span_kind="SPAN_KIND_SERVER"}[15m])))

To visualize the 95th percentile, choose the time series visualization with unit seconds.

Screenshot of the duration Visualization

The dots on the duration visualization are Exemplars. Clicking on an Exemplar provides a link for querying an example Trace with Tempo.

Screenshot of an Exemplar

Summary

This tutorial showed a complete example of how to set up a dashboard for monitoring HTTP services purely based on metrics generated from trace data. As distributed tracing is the most mature signal in OpenTelemetry, this approach will work across a wide range of SDKs and frameworks.