Custom failure alert mapping

Failure alerts indicate that the system has entered an invalid, undesired, or inconsistent state. Unlike saturation or error alerts, which report operational symptoms, failure alerts describe incorrect configuration or topology, such as:

Mismatched replica counts
Incorrect leader/master assignment
Missing nodes
Resource configuration violations
Broken invariants or cluster state inconsistencies

Failure alerts contribute directly to entity health scoring and appear in RCA workbench timelines.

When to create a failure alert

Create a failure alert when:

Desired and actual state must match (for example, replicas, scaling targets, node roles)
A known invariant is violated
A configuration setting makes the system functionally incorrect
A system component is missing or in the wrong state
A resource is used incorrectly relative to its design (not merely exhausted)

Required labels

A failure alert must include the following labels:

Label	Purpose
`asserts_alert_category=failure`	Identifies the alert as a system-state failure
`asserts_entity_type`	Identifies the type of entity receiving the insight
`asserts_severity`	Indicates the impact level (info, warning, critical)

Recommended:

Label	Purpose
`asserts_env`	Enables accurate entity resolution across environments
`asserts_site`	Identifies region or cluster alignment

Best practices to write failure alerts

Use the following best practices to help you write custom failure alerts.

Compare desired vs actual state

desired_replicas - actual_replicas > 0

Use `for:` to reduce flapping

for: 2m

Preserve scoping labels to aggregate

Failure alerts must retain entity-identifying labels.

Handle missing data explicitly

Use absent() when metric disappearance is a failure
Combine with up{} when metric disappearance should be ignored
Avoid firing solely due to scrape failures

Example: Redis master missing

# Redis Master Missing
# Note this covers both cluster mode and HA mode, thus we are counting by redis_mode
- alert: RedisMissingMaster
  expr: |-
    count by (job, service, redis_mode, namespace, asserts_env, asserts_site) (
      redis_instance_info{role="master"}
    ) == 0
  for: 1m
  labels:
    asserts_severity: critical
    asserts_entity_type: Service
    asserts_alert_category: failure

Example: Replica mismatch

alert: DeploymentReplicaMismatch
expr: |
  kube_deployment_spec_replicas{deployment="checkout"} 
    != kube_deployment_status_replicas{deployment="checkout"}
labels:
  asserts_alert_category: failure
  asserts_entity_type: Service
  asserts_severity: warning
  asserts_env: prod
annotations:
  summary: 'Replica count mismatch'
  description: 'The checkout deployment has mismatched desired/actual replicas.'

Example: Incorrect database connection configuration

alert: PostgreSQLHighConnectionsConfigFailure
expr: |
  sum(pg_stat_activity_count{asserts_env!=""}) by (asserts_env, namespace, service)
    > (
        avg(pg_settings_max_connections{asserts_env!=""})
        - avg(pg_settings_superuser_reserved_connections{asserts_env!=""})
      ) * 0.7
labels:
  asserts_alert_category: failure
  asserts_entity_type: Service
  asserts_severity: critical
annotations:
  summary: 'PostgreSQL configuration failure'
  description: 'Active connections are nearing max minus reserved admin slots.'

How failure alerts appear in the knowledge graph

When a failure alert fires:

The affected entity shows a critical or degraded health state
The alert appears in RCA workbench timeline as a failure insight
Clearing the condition returns the entity to a healthy state

Failure alerts combine with saturation, anomaly, and error insights to create a full picture of system behavior.

Next steps

To learn how to create alerts, refer to Configure alert rules
To learn how to import a YAML file for alert creation, refer to Import to Grafana-managed rules