---
title: "Configure high availability | Grafana documentation"
description: "Configure High Availability"
---

# Configure high availability

Grafana Alerting uses the Prometheus model of separating the evaluation of alert rules from the delivering of notifications. In this model, the evaluation of alert rules is done in the alert generator and the delivering of notifications is done in the alert receiver. In Grafana Alerting, the alert generator is the Scheduler and the receiver is the Alertmanager.

[High availability](/static/img/docs/alerting/unified/high-availability-ua.png)

When running multiple instances of Grafana, all alert rules are evaluated on all instances by default. You can think of the evaluation of alert rules as being duplicated by the number of running Grafana instances. This is how Grafana Alerting ensures that as long as at least one Grafana instance is working, alert rules are still evaluated and notifications for alerts are still sent.

If you want to reduce this duplication, you can enable [single-node evaluation mode](#single-node-evaluation-mode) so that only one instance evaluates alert rules.

You can find this duplication in state history and it is a good way to [verify your high availability setup](#verify-your-high-availability-setup).

While the alert generator evaluates all alert rules on all instances, the alert receiver makes a best-effort attempt to avoid duplicate notifications. The alertmanagers use a gossip protocol to share information between them to prevent sending duplicated notifications.

Alertmanager chooses availability over consistency, which may result in occasional duplicated or out-of-order notifications. It takes the opinion that duplicate or out-of-order notifications are better than no notifications.

Alertmanagers also gossip silences, which means a silence created on one Grafana instance is replicated to all other Grafana instances. Both notifications and silences are persisted to the database periodically, and during graceful shut down.

## Enable alerting high availability using Memberlist

**Before you begin**

Since gossiping of notifications and silences uses both TCP and UDP port `9094`, ensure that each Grafana instance is able to accept incoming connections on these ports.

**To enable high availability support:**

1. In your custom configuration file ($WORKING\_DIR/conf/custom.ini), go to the `[unified_alerting]` section.
2. Set `[ha_peers]` to the number of hosts for each Grafana instance in the cluster (using a format of host:port), for example, `ha_peers=10.0.0.5:9094,10.0.0.6:9094,10.0.0.7:9094`. You must have at least one (1) Grafana instance added to the `ha_peers` section.
3. Set `[ha_listen_address]` to the instance IP address using a format of `host:port` (or the [Pod’s](https://kubernetes.io/docs/concepts/workloads/pods/) IP in the case of using Kubernetes). By default, it is set to listen to all interfaces (`0.0.0.0`).
4. Set `[ha_advertise_address]` to the instance’s hostname or IP address in the format “host:port”. Use this setting when the instance is behind NAT (Network Address Translation), such as in Docker Swarm or Kubernetes service, where external and internal addresses differ. This address helps other cluster instances communicate with it. The setting is optional.
5. Set `[ha_peer_timeout]` in the `[unified_alerting]` section of the custom.ini to specify the time to wait for an instance to send a notification via the Alertmanager. The default value is 15s, but it may increase if Grafana servers are located in different geographic regions or if the network latency between them is high.

For a demo, see this [example using Docker Compose](https://github.com/grafana/alerting-ha-docker-examples/tree/main/memberlist).

## Enable alerting high availability using Redis

As an alternative to Memberlist, you can configure Redis to enable high availability. Redis standalone, Redis Cluster and Redis Sentinel modes are supported.

> Note
> 
> Memberlist is the preferred option for high availability. Use Redis only in environments where direct communication between Grafana servers is not possible, such as when TCP or UDP ports are blocked.

01. Make sure you have a Redis server that supports pub/sub. If you use a proxy in front of your Redis cluster, make sure the proxy supports pub/sub.
02. In your custom configuration file ($WORKING\_DIR/conf/custom.ini), go to the `[unified_alerting]` section.
03. Set `ha_redis_address` to the Redis server address or addresses Grafana should connect to. It can be a single Redis address if using Redis standalone, or a list of comma-separated addresses if using Redis Cluster or Sentinel.
04. Optional: Set `ha_redis_cluster_mode_enabled` to `true` if you are using Redis Cluster.
05. Optional: Set `ha_redis_sentinel_mode_enabled` to `true` if you are using Redis Sentinel. Also set `ha_redis_sentinel_master_name` to the Redis Sentinel master name.
06. Optional: Set the username and password if authentication is enabled on the Redis server using `ha_redis_username` and `ha_redis_password`.
07. Optional: Set the username and password if authentication is enabled on Redis Sentinel using `ha_redis_sentinel_username` and `ha_redis_sentinel_password`.
08. Optional: Set `ha_redis_prefix` to something unique if you plan to share the Redis server with multiple Grafana instances.
09. Optional: Set `ha_redis_tls_enabled` to `true` and configure the corresponding `ha_redis_tls_*` fields to secure communications between Grafana and Redis with Transport Layer Security (TLS).
10. Set `[ha_advertise_address]` to `ha_advertise_address = "${POD_IP}:9094"` This is required if the instance doesn’t have an IP address that is part of RFC 6890 with a default route.

For a demo, see this [example using Docker Compose](https://github.com/grafana/alerting-ha-docker-examples/tree/main/redis).

## Enable alerting high availability using Kubernetes

1. You can expose the Pod IP [through an environment variable](https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/) via the container definition.
   
   YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
   
   ```yaml
   env:
     - name: POD_IP
       valueFrom:
         fieldRef:
           fieldPath: status.podIP
   ```
2. Add the port 9094 to the Grafana deployment:
   
   YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
   
   ```yaml
   ports:
     - name: grafana
       containerPort: 3000
       protocol: TCP
     - name: gossip-tcp
       containerPort: 9094
       protocol: TCP
     - name: gossip-udp
       containerPort: 9094
       protocol: UDP
   ```
3. Add the environment variables to the Grafana deployment:
   
   YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
   
   ```yaml
   env:
     - name: POD_IP
       valueFrom:
         fieldRef:
           fieldPath: status.podIP
   ```
4. Create a headless service that returns the Pod IP instead of the service IP, which is what the `ha_peers` need:
   
   YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
   
   ```yaml
   apiVersion: v1
   kind: Service
   metadata:
     name: grafana-alerting
     namespace: grafana
     labels:
       app.kubernetes.io/name: grafana-alerting
       app.kubernetes.io/part-of: grafana
   spec:
     type: ClusterIP
     clusterIP: 'None'
     ports:
       - port: 9094
     selector:
       app: grafana
   ```
5. Make sure your grafana deployment has the label matching the selector, e.g. `app:grafana`:
6. Add in the grafana.ini:
   
   Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy
   
   ```bash
   [unified_alerting]
   enabled = true
   ha_listen_address = "${POD_IP}:9094"
   ha_peers = "grafana-alerting.grafana:9094"
   ha_advertise_address = "${POD_IP}:9094"
   ha_peer_timeout = 15s
   ha_reconnect_timeout = 2m
   ```

## Single-node evaluation mode

> Note
> 
> Single-node evaluation mode is currently in [public preview](/docs/release-life-cycle/). Grafana Labs offers limited support, and breaking changes might occur prior to the feature being made generally available.

By default, all Grafana instances in a high-availability cluster evaluate all alert rules. This means query load on data sources is multiplied by the number of Grafana instances. Single-node evaluation mode changes this so that only one instance evaluates alert rules, reducing query load from N times to 1.

**To enable single-node evaluation mode**, add the following to your `[unified_alerting]` section:

ini ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```ini
[unified_alerting]
ha_single_node_evaluation = true
```

This setting requires high availability clustering to be configured (either Memberlist or Redis).

### How it works

The Grafana cluster automatically chooses a primary instance that is responsible for evaluating all alert rules. Other instances skip evaluation entirely.

- **Alert broadcasting:** The primary instance broadcasts fired alerts to all other instances through the cluster communication channel. This ensures that every instance’s embedded Alertmanager has the current alerts, which is needed for failure recovery and for the Alertmanager API to return correct data on all instances.
- **Automatic failure recovery:** When the primary instance becomes unavailable, the cluster reassigns positions and a new instance becomes responsible for evaluation. During failure recovery, there is a brief gap in evaluations. Existing alert states remain in the database.

### Tradeoffs

Expand table

| Default HA (all instances evaluate)         | Single-node evaluation mode                  |
|---------------------------------------------|----------------------------------------------|
| Redundant evaluation on all nodes           | Only one node evaluates                      |
| Higher query load on data sources (N times) | Reduced query load (1 time)                  |
| No evaluation gap on instance failure       | Brief evaluation gap during failure recovery |

### Monitor single-node evaluation mode

You can verify that single-node evaluation mode is working correctly by monitoring the following metrics.

Expand table

| Metric                                                 | Description                                                                                                                          |
|--------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| `grafana_alertmanager_peer_position`                   | The position of each instance in the cluster. The instance at position 0 is the primary and evaluates all alert rules.               |
| `grafana_alerting_alerts_received_total`               | Total number of alerts received by each instance. Non-primary instances should receive alerts through the cluster broadcast channel. |
| `grafana_alerting_alertmanager_alerts{state="active"}` | Number of active alerts on each instance. This value should be the same across all instances.                                        |

The following metrics are specific to the HA backend you are using:

**Memberlist (gossip)**

Expand table

| Metric                                                                                  | Description                                                                                                |
|-----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| `grafana_alertmanager_oversized_gossip_message_dropped_total{key="alerts:broadcast"}`   | Number of broadcast messages dropped due to a full message queue. A non-zero value indicates message loss. |
| `grafana_alertmanager_oversized_gossip_message_failure_total{key="alerts:broadcast"}`   | Number of broadcast messages that failed to send to a peer.                                                |
| `grafana_alertmanager_oversized_gossip_message_sent_total{key="alerts:broadcast"}`      | Number of broadcast messages sent to peers.                                                                |
| `grafana_alertmanager_oversize_gossip_message_duration_seconds{key="alerts:broadcast"}` | Duration of broadcast message sends. Useful for detecting network latency between peers.                   |

**Redis**

Expand table

| Metric                                                                                                     | Description                                                                                                                                                                                                                        |
|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `grafana_alertmanager_cluster_messages_publish_failures_total{msg_type="update",reason="buffer_overflow"}` | Number of state sync messages dropped due to a full message queue. A non-zero value indicates message loss. These metrics are shared across all HA state channels (alerts, silences, notification log), not only alert broadcasts. |
| `grafana_alertmanager_cluster_messages_publish_failures_total{msg_type="update",reason="redis_issue"}`     | Number of state sync messages that failed due to a Redis error.                                                                                                                                                                    |
| `grafana_alertmanager_cluster_messages_sent_total{msg_type="update"}`                                      | Total number of state sync messages sent to Redis. Includes all HA state channels, not only alert broadcasts.                                                                                                                      |

### Tune alert broadcast queue size

The primary instance uses a message queue to broadcast alerts to other instances. By default, the queue holds up to 200 messages. If you have a large number of alert rules, the queue may fill up, causing messages to be dropped. You can detect this by monitoring the drop metric for your HA backend (see metrics tables above).

To increase the queue size, add the following to your `[unified_alerting]` section:

ini ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```ini
[unified_alerting]
ha_single_evaluation_alert_broadcast_queue_size = 500
```

The default value is `200`. This setting applies to both Memberlist and Redis HA backends.

## Verify your high availability setup

When running multiple Grafana instances, all alert rules are evaluated on every instance by default. This multiple evaluation of alert rules is visible in the [state history](/docs/grafana-cloud/alerting-and-irm/alerting/monitor-status/view-alert-state-history/) and provides a straightforward way to verify that your high availability configuration is working correctly.

> Note
> 
> If using a mix of `execute_alerts=false` and `execute_alerts=true` on the HA nodes, since the alert state is not shared amongst the Grafana instances, the instances with `execute_alerts=false` do not show any alert status.
> 
> The HA settings (`ha_peers`, etc.) apply only to communication between alertmanagers, synchronizing silences and attempting to avoid duplicate notifications, as described in the introduction.

You can also confirm your high availability setup by monitoring Alertmanager metrics exposed by Grafana.

> Note
> 
> Starting in Grafana v12.4, these metrics are prefixed with `grafana_` (for example, `grafana_alertmanager_cluster_members`). If you are upgrading from an earlier version, update your dashboards and alert rules accordingly.

Expand table

| Metric                                                         | Description                                                                                                                                                |
|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `grafana_alertmanager_cluster_members`                         | Number indicating current number of members in cluster.                                                                                                    |
| `grafana_alertmanager_cluster_messages_received_total`         | Total number of cluster messages received.                                                                                                                 |
| `grafana_alertmanager_cluster_messages_received_size_total`    | Total size of cluster messages received.                                                                                                                   |
| `grafana_alertmanager_cluster_messages_sent_total`             | Total number of cluster messages sent.                                                                                                                     |
| `grafana_alertmanager_cluster_messages_sent_size_total`        | Total number of cluster messages received.                                                                                                                 |
| `grafana_alertmanager_cluster_messages_publish_failures_total` | Total number of messages that failed to be published.                                                                                                      |
| `grafana_alertmanager_cluster_pings_seconds`                   | Histogram of latencies for ping messages.                                                                                                                  |
| `grafana_alertmanager_cluster_pings_failures_total`            | Total number of failed pings.                                                                                                                              |
| `grafana_alertmanager_peer_position`                           | The position an Alertmanager instance believes it holds, which defines its role in the cluster. Peers should be numbered sequentially, starting from zero. |

You can confirm the number of Grafana instances in your alerting high availability setup by querying the `grafana_alertmanager_cluster_members` and `grafana_alertmanager_peer_position` metrics.

Note that these alerting high availability metrics are exposed via the `/metrics` endpoint in Grafana, and are not automatically collected or displayed. If you have a Prometheus instance connected to Grafana, add a `scrape_config` to scrape Grafana metrics and then query these metrics in Explore.

YAML ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```yaml
- job_name: grafana
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  static_configs:
    - targets:
        - grafana:3000
```

For more information on monitoring alerting metrics, refer to [Alerting meta-monitoring](/docs/grafana-cloud/alerting-and-irm/alerting/monitor/). For a demo, see [alerting high availability examples using Docker Compose](https://github.com/grafana/alerting-ha-docker-examples/).

## Prevent duplicate notifications

In high-availability mode, each Grafana instance runs its own pre-configured alertmanager to handle alert notifications.

When multiple Grafana instances are running, all alert rules are evaluated on each instance by default. Each instance sends firing alerts to its respective Alertmanager. This results in notification handling being duplicated across all running Grafana instances.

Alertmanagers in HA mode communicate with each other to coordinate notification delivery. However, this setup can sometimes lead to duplicated or out-of-order notifications. By design, HA prioritizes sending duplicate notifications over the risk of missing notifications.

To avoid duplicate notifications, you can configure a shared alertmanager to manage notifications for all Grafana instances. For more information, refer to [add an external alertmanager](/docs/grafana-cloud/alerting-and-irm/alerting/set-up/configure-alertmanager/).