Failure
Failure alerts indicate some kind of deviation in the system’s configuration from its desired state. For example, when replicas are configured in a Redis database, there must be at least one master instance. When there are none, Redis is not operating as configured. These kind of problems are reported as failure alerts.
Here’s an example of an alert rule to report this failure:
# Redis Master Missing
# Note this covers both cluster mode and HA mode, thus we are counting by redis_mode
- alert: RedisMissingMaster
expr: |-
count by (job, service, redis_mode, namespace, asserts_env, asserts_site) (
redis_instance_info{role="master"}
) == 0
for: 1m
labels:
asserts_severity: critical
asserts_entity_type: Service
asserts_alert_category: failure
Asserts Meta Label | Description |
---|---|
asserts_env | Used by Asserts to identify the environment. All discovered entities and observed metrics are automatically scoped to an environment. |
asserts_site | Used by Asserts to identify the region/site within an environment. For example, you could have a prod environment but multiple regions, such as us-east-1 , us-west-2 , etc. This label is used to capture the region information. Note that this depends on how environment information is encoded in the metrics. Sometimes, both the environment and the region information may be encoded in a single label value; in such cases, the asserts_env label will contain that value, and this label may not be present. |
asserts_entity_type | Used by Asserts to identify the level at which the metric is being observed. The workload , service , and job are special labels that Asserts uses to identify the Service . These labels are also used to discover the Service entity in the Asserts entity model. In this example, while aggregating, these labels are retained, so this metric will be observed for the corresponding Service entity. |
asserts_severity | This label is used to indicate the severity of the problem as either warning (yellow) or critical (red). |
asserts_alert_category | Asserts categorizes all alerts into the SAAFE model: Saturation, Amend (configuration changes to the system), Anomaly, Failure, and Error. In this example, the label asserts_alert_category is used to categorize this alert as a Failure. |