Menu
Grafana Cloud

Meta monitoring for Cloud

Monitor your alerting metrics to ensure you identify potential issues before they become critical.

Meta monitoring is the process of monitoring your monitoring system and alerting when your monitoring is not working as it should.

In order to enable you to meta monitor, Grafana provides predefined metrics.

Identify which metrics are critical to your monitoring system (i.e. Grafana) and then set up how you want to monitor them.

You can use meta-monitoring metrics to understand the health of your alerting system in the following ways:

  1. [Optional] Create a dashboard in Grafana that uses this metric in a panel (just like you would for any other kind of metric).
  2. [Optional] Create an alert rule in Grafana that checks this metric regularly (just like you would do for any other kind of alert rule).
  3. [Optional] Use the Explore module in Grafana.

Metrics for Mimir-managed alerts

There may be more metrics available than listed here.

Use the data source and Metrics browser grafanacloud-usage to view all currently available usage metrics for Mimir-managed alerts in Grafana Cloud.

grafanacloud_instance_rule_evaluation_failures_total:rate5m

This metric is a counter that shows the total number of rule evaluation failures over the last 5 minutes.

Metrics for Alertmanager

The names of Alertmanager metrics may vary and there may be more metrics available than listed here.

Use the data source and Metrics browser grafanacloud-usage to view all currently available usage metrics for Alertmanager in Grafana Cloud.

These are some of the metrics for Alertmanager:

alertmanager_alerts

This metric is a counter that shows you the number of active, suppressed, and unprocessed alerts in Alertmanager. Suppressed alerts are silenced alerts, and unprocessed alerts are alerts that have been sent to the Alertmanager but have not been processed.

alertmanager_alerts_invalid_total

This metric is a counter that shows you the number of invalid alerts that were sent to Alertmanager. This counter should not exceed 0, and so in most cases you will want to create an alert that fires if whenever this metric increases.

alertmanager_notifications_total

This metric is a counter that shows you how many notifications have been sent by Alertmanager. The metric uses a label “integration” to show the number of notifications sent by integration, such as email.

alertmanager_notifications_failed_total

This metric is a counter that shows you how many notifications have failed in total. This metric also uses a label “integration” to show the number of failed notifications by integration, such as failed emails. In most cases you will want to use the rate function to understand how often notifications are failing to be sent.

alertmanager_notification_latency_seconds_bucket

This metric is a histogram that shows you the amount of time it takes Alertmanager to send notifications and for those notifications to be accepted by the receiving service. This metric uses a label “integration” to show the amount of time by integration. For example, you can use this metric to show the 95th percentile latency of sending emails.