Menu
Grafana Cloud

Request, error, and duration (RED)

Request Rate

In Asserts, the asserts:request:total records the total count of requests.

In SpringBoot, the request metrics for incoming requests are available through http_server_requests_seconds, which is a histogram. Similarly, the request metrics for outgoing calls are available through http_client_requests_seconds, which is also a histogram. These metrics are mapped to asserts:request:total for incoming and outgoing requests.

# Incoming requests
- record: asserts:request:total
  expr: |-
    label_replace(http_server_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_request: total
    asserts_request_type: inbound

# Outgoing requests made through Spring classes like RestTemplate
- record: asserts:request:total
  expr: |-
    label_replace(http_client_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_request: total
    asserts_request_type: outbound
Asserts Meta LabelDescription
asserts_sourceUsed by Asserts to identify which framework/instrumentation captured the metric.
asserts_metric_requestUsed by Asserts to identify this as a request metric. Valid values are total when the source metric is a counter and gauge when the source metric is a gauge.
asserts_request_typeUsed by Asserts to categorize requests into different kinds. By default, for all supported HTTP-based frameworks, Asserts categorizes requests as inbound for incoming requests and outbound for outgoing HTTP calls. These can also be arbitrary names to group APIs, e.g., timer_task or query, etc.
asserts_request_contextUsed by Asserts to identify a unique request. For HTTP requests, whether inbound or outbound, this typically maps to the relative part of the request URI with high-cardinality parameters stripped off. For example, /track/order/{} where {} is a placeholder for an order ID. Frameworks like Spring Boot actuator Prometheus metrics have labels like uri. The label_replace function is used to map uri to asserts_request_context.

Once these rules are added, the following will happen:

  • The Request Rate is computed and shown in the Service KPI Dashboard.
  • The Request Rate is observed for anomalies, and the RequestRateAnomaly is triggered when there are anomalies.

Note

In the previous example, the source metric is available as a counter, therefore, it was mapped to asserts:request:total. If the source metric was a gauge, then it should be mapped to asserts:request:gaugeand set asserts_metric_request: gauge.

Error Ratio

In Asserts, the asserts:error:total metric records the total count of errors, broken down by different error types. Let’s add this rule for SpringBoot inbound and outbound requests

# Inbound request errors
- record: asserts:client:error:total
  expr: |
    label_replace(http_server_requests_seconds_count{status=~"4.."}, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_error: client_total
    asserts_request_type: inbound

- record: asserts:error:total
  expr: |
    label_replace(http_server_requests_seconds_count {status=~"5.."}, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_error: total
    asserts_request_type: inbound
    asserts_error_type: server_errors

# Outbound request errors
- record: asserts:error:total
  expr: |
    label_replace(http_client_requests_seconds_count{status=~"4.."}, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_error: total
    asserts_request_type: outbound
    asserts_error_type: client_errors

- record: asserts:error:total
  expr: |
    label_replace(http_client_requests_seconds_count{status=~"5.."}, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_error: total
    asserts_request_type: outbound
    asserts_error_type: server_errors
Asserts Meta LabelDescription
asserts_metric_errorUsed by Asserts to identify this as an error metric of type counter. Valid values are total, gauge, client_total and client_gauge.
asserts_error_typeUsed by Asserts to categorize errors into different kinds. Commonly useful types are server_errors and client_errors. In this example, a condition on the status label has been used to define these types. Note that client errors for inbound calls are mapped using a special type client_total. This is because inbound client errors tend to be noisy. Asserts will still observe them, but the signals captured surface only when anomalies occur. That is, if there is a steady stream of client errors, no signal will be generated. However, if there is a sudden change in the rate of these errors, an anomaly signal will be generated.

Once these rules are added, the following will happen:

  • The Error Ratio is computed for all request contexts and shown in the Service KPI Dashboard. The ratio is computed as sum by(asserts_env, asserts_site, namespace, workload, service, job, asserts_request_type, asserts_request_context, asserts_error_type)(rate(asserts:error:total[5m])) ignoring(asserts_error_type) / sum by(asserts_env, asserts_site, namespace, workload, service, job, asserts_request_type, asserts_request_context)(rate(asserts:request:total[5m])). Note that the labels used in the aggregation for asserts:request:total and asserts:error:total metrics should match for the ratio to be recorded.
  • The ErrorRatioBreach is triggered if the ratio breaches a certain threshold.
  • The ErrorBuildup (multi burn-multi window) is triggered if the error budget breaches.
  • The Error Ratio is observed for anomalies, and ErrorRatioAnomaly is triggered when there are anomalies.

Note

In the above example, the source metric is available as a counter. So it was mapped to asserts:error:total. If the source metric were a gauge, then it should be mapped to asserts:error:gauge and set asserts_metric_error: gauge or asserts_metric_error: client_gauge in the case of inbound client errors.

Latency Average

Asserts computes the latency average using the following metrics:

  • asserts:latency:total - the latency total time in seconds
  • asserts:latency:count metrics - the total number of requests

Add the recording rules for these two metrics from the respective histogram metrics which have the _sum and _count metrics.

# Inbound Latency
- record: asserts:latency:total
  expr: |
    label_replace(http_server_requests_seconds_sum, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_latency: seconds_sum
    asserts_request_type: inbound

- record: asserts:latency:count
  expr: |
    label_replace(http_server_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_latency: count
    asserts_request_type: inbound

# Outbound latency
- record: asserts:latency:total
  expr:
    label_replace(http_client_requests_seconds_sum, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_latency: seconds_sum
    asserts_request_type: outbound

- record: asserts:latency:count
  expr:
    label_replace(http_client_requests_seconds_count, "asserts_request_context", "$1", "uri", "(.+)")
  labels:
    asserts_source: spring_boot
    asserts_metric_latency: count
    asserts_request_type: outbound
Asserts Meta LabelDescription
asserts_metric_latencyUsed by Asserts to identify the numerator and denominator to compute the latency average along with the unit of the source latency metric. Valid values for latency (the numerator) are seconds_sum, milliseconds_sum, and microseconds_sum. For the latency count (the denominator), the valid value is count.

After rules are added, the following occurs:

  • Latency Average is computed for all requests and shown in the Service KPI Dashboards.
  • The Latency Average is observed for anomalies, and LatencyAverageAnomaly is triggered when there are anomalies.

Note

In the previous example, the source metric is available as a counter, therefore, it was mapped to asserts:latency:total and asserts:latency:count. If the source metric was a gauge, then it should be directly mapped to asserts:latency:average. While doing this, be mindful of the labels in the source metric. When the source is a counter, Asserts does some aggregation internally, and only the key labels are retained, which reduces the cardinality in the metrics it records. In the direct mapping, this is not the case.

Latency P99

Similarly, we can record the latency p99 for the requests as follows:

# Inbound requests latency P99
- record: asserts:latency:p99
  expr: >
    label_replace(
      histogram_quantile (
        0.99,
        sum(rate(http_server_requests_seconds_bucket[5m]) > 0) by (le, namespace, job, service, workload, uri, asserts_env, asserts_site)
      )
      , "asserts_request_context", "$1", "uri", "(.+)"
    )
  labels:
    asserts_source: spring_boot
    asserts_entity_type: Service
    asserts_request_type: inbound

# Outbound requests latency P99
- record: asserts:latency:p99
  expr: >
    label_replace(
      histogram_quantile (
        0.99,
        sum(rate(http_client_requests_seconds_bucket[5m]) > 0) by (le, namespace, job, service, workload, uri, asserts_env, asserts_site)
      )
      , "asserts_request_context", "$1", "uri", "(.+)"
    )
  labels:
    asserts_source: spring_boot
    asserts_entity_type: Service
    asserts_request_type: outbound
Asserts Meta LabelDescription
asserts_envUsed by Asserts to identify the environment. All discovered entities and observed metrics are automatically scoped to an environment.
asserts_siteUsed by Asserts to identify the region/site within an environment. For example, you could have a prod environment but multiple regions, such as us-east-1, us-west-2, etc. This label is used to capture the region information. Note that this depends on how environment information is encoded in the metrics. Sometimes, both the environment and the region information may be encoded in a single label value; in such cases, the asserts_env label will contain that value, and this label may not be present.
asserts_entity_typeUsed by Asserts to identify the level at which the metric is being observed. The workload, service, and job are special labels that Asserts uses to identify the Service. These labels are also used to discover the Service entity in the Asserts entity model. In this example, while aggregating, these labels are retained, so this metric will be observed for the corresponding Service entity.

After this is recorded, Asserts shows this metric in the Service KPI Dashboard, and begins observing for the clock minutes when the Latency P99 exceeds a threshold. These minutes are tracked through a total bad minutes counter. Based on the ratio of bad minutes to total minutes in a given time-window, the LatencyP99ErrorBuildup is triggered. This alert is a Multi-Burn, Multi-Window error budget-based alert.

Latency P99 across all requests of a Service

The Latency P99 for the entire service, regardless of different request contexts, can be recorded as follows

- record: asserts:latency:service:p99
  expr: >
    histogram_quantile (
      0.99,
      sum(rate(http_server_requests_seconds_bucket[5m]) > 0)
        by (le, namespace, job, service, workload, asserts_env, asserts_site)
    )
  labels:
    asserts_entity_type: Service
    asserts_request_type: inbound
    asserts_source: spring_boot

This metric is useful while creating a Latency SLO for the entire service.