---
title: "Custom failure alert mapping | Grafana Cloud documentation"
description: "Create and map failure alerts to detect invalid or undesired system states using the knowledge graph"
---

# Custom failure alert mapping

Failure alerts indicate that the system has entered an invalid, undesired, or inconsistent state. Unlike saturation or error alerts, which report operational symptoms, failure alerts describe incorrect configuration or topology, such as:

- Mismatched replica counts
- Incorrect leader/master assignment
- Missing nodes
- Resource configuration violations
- Broken invariants or cluster state inconsistencies

Failure alerts contribute directly to entity health scoring and appear in RCA workbench timelines.

## When to create a failure alert

Create a failure alert when:

- Desired and actual state must match (for example, replicas, scaling targets, node roles)
- A known invariant is violated
- A configuration setting makes the system functionally incorrect
- A system component is missing or in the wrong state
- A resource is used incorrectly relative to its design (not merely exhausted)

## Required labels

A failure alert must include the following labels:

Expand table

| Label                            | Purpose                                              |
|----------------------------------|------------------------------------------------------|
| `asserts_alert_category=failure` | Identifies the alert as a system-state failure       |
| `asserts_entity_type`            | Identifies the type of entity receiving the insight  |
| `asserts_severity`               | Indicates the impact level (info, warning, critical) |

Recommended:

Expand table

| Label          | Purpose                                                |
|----------------|--------------------------------------------------------|
| `asserts_env`  | Enables accurate entity resolution across environments |
| `asserts_site` | Identifies region or cluster alignment                 |

## Best practices to write failure alerts

Use the following best practices to help you write custom failure alerts.

### Compare desired vs actual state

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
desired_replicas - actual_replicas > 0
```

### Use `for:` to reduce flapping

YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```yaml
for: 2m
```

### Preserve scoping labels to aggregate

Failure alerts must retain entity-identifying labels.

### Handle missing data explicitly

- Use `absent()` when metric disappearance is a failure
- Combine with `up{}` when metric disappearance should be ignored
- Avoid firing solely due to scrape failures

### Example: Redis master missing

YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```yaml
# Redis Master Missing
# Note this covers both cluster mode and HA mode, thus we are counting by redis_mode
- alert: RedisMissingMaster
  expr: |-
    count by (job, service, redis_mode, namespace, asserts_env, asserts_site) (
      redis_instance_info{role="master"}
    ) == 0
  for: 1m
  labels:
    asserts_severity: critical
    asserts_entity_type: Service
    asserts_alert_category: failure
```

### Example: Replica mismatch

YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```yaml
alert: DeploymentReplicaMismatch
expr: |
  kube_deployment_spec_replicas{deployment="checkout"} 
    != kube_deployment_status_replicas{deployment="checkout"}
labels:
  asserts_alert_category: failure
  asserts_entity_type: Service
  asserts_severity: warning
  asserts_env: prod
annotations:
  summary: 'Replica count mismatch'
  description: 'The checkout deployment has mismatched desired/actual replicas.'
```

### Example: Incorrect database connection configuration

YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```yaml
alert: PostgreSQLHighConnectionsConfigFailure
expr: |
  sum(pg_stat_activity_count{asserts_env!=""}) by (asserts_env, namespace, service)
    > (
        avg(pg_settings_max_connections{asserts_env!=""})
        - avg(pg_settings_superuser_reserved_connections{asserts_env!=""})
      ) * 0.7
labels:
  asserts_alert_category: failure
  asserts_entity_type: Service
  asserts_severity: critical
annotations:
  summary: 'PostgreSQL configuration failure'
  description: 'Active connections are nearing max minus reserved admin slots.'
```

### How failure alerts appear in the knowledge graph

When a failure alert fires:

- The affected entity shows a critical or degraded health state
- The alert appears in RCA workbench timeline as a failure insight
- Clearing the condition returns the entity to a healthy state

Failure alerts combine with saturation, anomaly, and error insights to create a full picture of system behavior.

## Next steps

- To learn how to create alerts, refer to [Configure alert rules](/docs/grafana/latest/alerting/alerting-rules/)
- To learn how to import a YAML file for alert creation, refer to [Import to Grafana-managed rules](/docs/grafana/latest/alerting/alerting-rules/alerting-migration/)
