Custom failure alert mapping
Failure alerts indicate that the system has entered an invalid, undesired, or inconsistent state. Unlike saturation or error alerts, which report operational symptoms, failure alerts describe incorrect configuration or topology, such as:
- Mismatched replica counts
- Incorrect leader/master assignment
- Missing nodes
- Resource configuration violations
- Broken invariants or cluster state inconsistencies
Failure alerts contribute directly to entity health scoring and appear in RCA workbench timelines.
When to create a failure alert
Create a failure alert when:
- Desired and actual state must match (for example, replicas, scaling targets, node roles)
- A known invariant is violated
- A configuration setting makes the system functionally incorrect
- A system component is missing or in the wrong state
- A resource is used incorrectly relative to its design (not merely exhausted)
Required labels
A failure alert must include the following labels:
Recommended:
Best practices to write failure alerts
Use the following best practices to help you write custom failure alerts.
Compare desired vs actual state
desired_replicas - actual_replicas > 0Use for: to reduce flapping
for: 2mPreserve scoping labels to aggregate
Failure alerts must retain entity-identifying labels.
Handle missing data explicitly
- Use
absent()when metric disappearance is a failure - Combine with
up{}when metric disappearance should be ignored - Avoid firing solely due to scrape failures
Example: Redis master missing
# Redis Master Missing
# Note this covers both cluster mode and HA mode, thus we are counting by redis_mode
- alert: RedisMissingMaster
expr: |-
count by (job, service, redis_mode, namespace, asserts_env, asserts_site) (
redis_instance_info{role="master"}
) == 0
for: 1m
labels:
asserts_severity: critical
asserts_entity_type: Service
asserts_alert_category: failureExample: Replica mismatch
alert: DeploymentReplicaMismatch
expr: |
kube_deployment_spec_replicas{deployment="checkout"}
!= kube_deployment_status_replicas{deployment="checkout"}
labels:
asserts_alert_category: failure
asserts_entity_type: Service
asserts_severity: warning
asserts_env: prod
annotations:
summary: 'Replica count mismatch'
description: 'The checkout deployment has mismatched desired/actual replicas.'Example: Incorrect database connection configuration
alert: PostgreSQLHighConnectionsConfigFailure
expr: |
sum(pg_stat_activity_count{asserts_env!=""}) by (asserts_env, namespace, service)
> (
avg(pg_settings_max_connections{asserts_env!=""})
- avg(pg_settings_superuser_reserved_connections{asserts_env!=""})
) * 0.7
labels:
asserts_alert_category: failure
asserts_entity_type: Service
asserts_severity: critical
annotations:
summary: 'PostgreSQL configuration failure'
description: 'Active connections are nearing max minus reserved admin slots.'How failure alerts appear in the knowledge graph
When a failure alert fires:
- The affected entity shows a critical or degraded health state
- The alert appears in RCA workbench timeline as a failure insight
- Clearing the condition returns the entity to a healthy state
Failure alerts combine with saturation, anomaly, and error insights to create a full picture of system behavior.
Next steps
- To learn how to create alerts, refer to Configure alert rules
- To learn how to import a YAML file for alert creation, refer to Import to Grafana-managed rules



