Grafana Cloud

Custom failure alert mapping

Failure alerts indicate that the system has entered an invalid, undesired, or inconsistent state. Unlike saturation or error alerts, which report operational symptoms, failure alerts describe incorrect configuration or topology, such as:

  • Mismatched replica counts
  • Incorrect leader/master assignment
  • Missing nodes
  • Resource configuration violations
  • Broken invariants or cluster state inconsistencies

Failure alerts contribute directly to entity health scoring and appear in RCA workbench timelines.

When to create a failure alert

Create a failure alert when:

  • Desired and actual state must match (for example, replicas, scaling targets, node roles)
  • A known invariant is violated
  • A configuration setting makes the system functionally incorrect
  • A system component is missing or in the wrong state
  • A resource is used incorrectly relative to its design (not merely exhausted)

Required labels

A failure alert must include the following labels:

LabelPurpose
asserts_alert_category=failureIdentifies the alert as a system-state failure
asserts_entity_typeIdentifies the type of entity receiving the insight
asserts_severityIndicates the impact level (info, warning, critical)

Recommended:

LabelPurpose
asserts_envEnables accurate entity resolution across environments
asserts_siteIdentifies region or cluster alignment

Best practices to write failure alerts

Use the following best practices to help you write custom failure alerts.

Compare desired vs actual state

promql
desired_replicas - actual_replicas > 0

Use for: to reduce flapping

YAML
for: 2m

Preserve scoping labels to aggregate

Failure alerts must retain entity-identifying labels.

Handle missing data explicitly

  • Use absent() when metric disappearance is a failure
  • Combine with up{} when metric disappearance should be ignored
  • Avoid firing solely due to scrape failures

Example: Redis master missing

YAML
# Redis Master Missing
# Note this covers both cluster mode and HA mode, thus we are counting by redis_mode
- alert: RedisMissingMaster
  expr: |-
    count by (job, service, redis_mode, namespace, asserts_env, asserts_site) (
      redis_instance_info{role="master"}
    ) == 0
  for: 1m
  labels:
    asserts_severity: critical
    asserts_entity_type: Service
    asserts_alert_category: failure

Example: Replica mismatch

YAML
alert: DeploymentReplicaMismatch
expr: |
  kube_deployment_spec_replicas{deployment="checkout"} 
    != kube_deployment_status_replicas{deployment="checkout"}
labels:
  asserts_alert_category: failure
  asserts_entity_type: Service
  asserts_severity: warning
  asserts_env: prod
annotations:
  summary: 'Replica count mismatch'
  description: 'The checkout deployment has mismatched desired/actual replicas.'

Example: Incorrect database connection configuration

YAML
alert: PostgreSQLHighConnectionsConfigFailure
expr: |
  sum(pg_stat_activity_count{asserts_env!=""}) by (asserts_env, namespace, service)
    > (
        avg(pg_settings_max_connections{asserts_env!=""})
        - avg(pg_settings_superuser_reserved_connections{asserts_env!=""})
      ) * 0.7
labels:
  asserts_alert_category: failure
  asserts_entity_type: Service
  asserts_severity: critical
annotations:
  summary: 'PostgreSQL configuration failure'
  description: 'Active connections are nearing max minus reserved admin slots.'

How failure alerts appear in the knowledge graph

When a failure alert fires:

  • The affected entity shows a critical or degraded health state
  • The alert appears in RCA workbench timeline as a failure insight
  • Clearing the condition returns the entity to a healthy state

Failure alerts combine with saturation, anomaly, and error insights to create a full picture of system behavior.

Next steps