Request, error, and duration (RED)
Learn how the knowledge graph maps requests, errors, and duration metrics.
Request Rate
In the knowledge graph, the asserts:request:total records the total count of requests.
In SpringBoot, the request metrics for incoming requests are available through http_server_requests_seconds, which is a histogram. Similarly, the request metrics for
outgoing calls are available through http_client_requests_seconds, which is also a histogram. These metrics are mapped
to asserts:request:total for incoming and outgoing requests.
# Incoming requests
- record: asserts:request:total
expr: |-
label_replace(http_server_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_request: total
asserts_request_type: inbound
# Outgoing requests made through Spring classes like RestTemplate
- record: asserts:request:total
expr: |-
label_replace(http_client_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_request: total
asserts_request_type: outboundAfter these rules are added, the following happens:
- The Request Rate is computed and shown in the Service KPI Dashboard.
- The Request Rate is observed for anomalies, and the RequestRateAnomaly is triggered when there are anomalies.
Note
In the previous example, the source metric is available as a counter, therefore, it was mapped to
asserts:request:total. If the source metric was a gauge, then it should be mapped toasserts:request:gaugeand setasserts_metric_request: gauge.
Error Ratio
In the knowledge graph, the asserts:error:total metric records the total count of errors, broken down by different error types.
Let’s add this rule for Spring Boot inbound and outbound requests:
# Inbound request errors
- record: asserts:client:error:total
expr: |
label_replace(http_server_requests_seconds_count{status=~"4.."}, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_error: client_total
asserts_request_type: inbound
- record: asserts:error:total
expr: |
label_replace(http_server_requests_seconds_count {status=~"5.."}, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_error: total
asserts_request_type: inbound
asserts_error_type: server_errors
# Outbound request errors
- record: asserts:error:total
expr: |
label_replace(http_client_requests_seconds_count{status=~"4.."}, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_error: total
asserts_request_type: outbound
asserts_error_type: client_errors
- record: asserts:error:total
expr: |
label_replace(http_client_requests_seconds_count{status=~"5.."}, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_error: total
asserts_request_type: outbound
asserts_error_type: server_errorsAfter these rules are added, the following happens:
- The Error Ratio is computed for all request contexts and shown in the Service KPI Dashboard. The ratio is computed as
sum by(asserts_env, asserts_site, namespace, workload, service, job, asserts_request_type, asserts_request_context, asserts_error_type)(rate(asserts:error:total[5m])) ignoring(asserts_error_type) / sum by(asserts_env, asserts_site, namespace, workload, service, job, asserts_request_type, asserts_request_context)(rate(asserts:request:total[5m])). Note that the labels used in the aggregation forasserts:request:totalandasserts:error:totalmetrics should match for the ratio to be recorded. - The ErrorRatioBreach is triggered if the ratio breaches a certain threshold.
- The ErrorBuildup (multi burn-multi window) is triggered if the error budget breaches.
- The Error Ratio is observed for anomalies, and ErrorRatioAnomaly is triggered when there are anomalies.
Note
In the above example, the source metric is available as a counter. So it was mapped to
asserts:error:total. If the source metric were a gauge, then it should be mapped toasserts:error:gaugeand setasserts_metric_error: gaugeorasserts_metric_error: client_gaugein the case of inbound client errors.
Latency Average
The knowledge graph computes the latency average using the following metrics:
asserts:latency:total- the latency total time in secondsasserts:latency:count- the total number of requests
Add the recording rules for these two metrics from the respective histogram metrics which have the _sum and _count metrics.
# Inbound Latency
- record: asserts:latency:total
expr: |
label_replace(http_server_requests_seconds_sum, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_latency: seconds_sum
asserts_request_type: inbound
- record: asserts:latency:count
expr: |
label_replace(http_server_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_latency: count
asserts_request_type: inbound
# Outbound latency
- record: asserts:latency:total
expr:
label_replace(http_client_requests_seconds_sum, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_latency: seconds_sum
asserts_request_type: outbound
- record: asserts:latency:count
expr:
label_replace(http_client_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
labels:
asserts_source: spring_boot
asserts_metric_latency: count
asserts_request_type: outboundAfter rules are added, the following occurs:
- Latency Average is computed for all requests and shown in the Service KPI Dashboards.
- The Latency Average is observed for anomalies, and LatencyAverageAnomaly is triggered when there are anomalies.
Note
In the previous example, the source metric is available as a counter, therefore, it was mapped to
asserts:latency:totalandasserts:latency:count. If the source metric was a gauge, then it should be directly mapped toasserts:latency:average. While doing this, be mindful of the labels in the source metric. When the source is a counter, Asserts does some aggregation internally, and only the key labels are retained, which reduces the cardinality in the metrics it records. In the direct mapping, this is not the case.
Latency P99
Similarly, we can record the latency p99 for the requests as follows:
# Inbound requests latency P99
- record: asserts:latency:p99
expr: >
label_replace(
histogram_quantile (
0.99,
sum(rate(http_server_requests_seconds_bucket[5m]) > 0) by (le, namespace, job, service, workload, uri, asserts_env, asserts_site)
)
, "asserts_request_context", "$1", "uri", "(.+)"
)
labels:
asserts_source: spring_boot
asserts_entity_type: Service
asserts_request_type: inbound
# Outbound requests latency P99
- record: asserts:latency:p99
expr: >
label_replace(
histogram_quantile (
0.99,
sum(rate(http_client_requests_seconds_bucket[5m]) > 0) by (le, namespace, job, service, workload, uri, asserts_env, asserts_site)
)
, "asserts_request_context", "$1", "uri", "(.+)"
)
labels:
asserts_source: spring_boot
asserts_entity_type: Service
asserts_request_type: outboundAfter this is recorded, the knowledge graph shows this metric in the Service KPI Dashboard, and begins observing for the clock
minutes when the Latency P99 exceeds a threshold. These minutes are tracked through a total bad minutes counter.
Based on the ratio of bad minutes to total minutes in a given time-window, the LatencyP99ErrorBuildup is triggered. This alert is a Multi-Burn, Multi-Window error budget-based alert.
Latency P99 across all requests of a Service
The Latency P99 for the entire service, regardless of different request contexts, can be recorded as follows:
- record: asserts:latency:service:p99
expr: >
histogram_quantile (
0.99,
sum(rate(http_server_requests_seconds_bucket[5m]) > 0)
by (le, namespace, job, service, workload, asserts_env, asserts_site)
)
labels:
asserts_entity_type: Service
asserts_request_type: inbound
asserts_source: spring_bootThis metric is useful while creating a Latency SLO for the entire service.



