You should know about... these useful Prometheus alerting rules

• 1 Apr, 2021 • 5 min

Setting up Prometheus to scrape your targets for metrics is usually just one part of your larger observability strategy. The other piece in the equation is figuring out what you want your metrics to tell you and when and how often you should know about it. Thankfully, Prometheus makes it really easy for you to define alerting rules using PromQL, so you know when things are going north, south, or in no direction at all.

Determining what and when you should be alerted can be a practiced combination of understanding your systems’ KPIs and relying on your own experience and instinct. To provide some inspiration for you to get started right away, the Grafana Labs Solutions Engineering team put together this quick blog of some of our favorite alerting rules.

The ‘up’ query

From Eldin:

One of the top benefits of having a pull-based monitoring system is that you have a better insight into knowing if services are up or down. The first rule that I think is important to know about is the up query. It’s so incredibly simple, yet so powerful. Note: I mostly use this for exporters.

# Alert for any target that is unreachable for >1 minute.
  - alert: TargetMissing
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: “Prometheus Target Missing (instance {{ $labels.instance }})”
      description: “A Prometheus Target has gone missing, please investigate”

Alerts for USE and RED

From Mike Johnson:

Many people familiar with monitoring are concerned about creating yet another alert sprawl generator when migrating to a new platform such as Prometheus. Some even think that instead of alerting on metrics, they should alert on application or service metrics only. It is our belief that you still want to set up metrics-based alerting in some capacity… though maybe not for every blip of CPU or memory.

Instead, we recommend you focus on your USE (utilization, saturation, errors on infrastructure) and RED (rates, errors, duration/latency on services) key performance indicators.

For your USE metrics, node_exporter generates these metrics on your behalf. Alerts where you are approaching physical limits on infrastructure are key, such as memory:

- alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

Or for disk:

  # Please add ignored mount points in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostDiskWillFillIn24Hours
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
      description: Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

Not all alerts are created equal, so it’s important to be selective (and your local Grafana Labs Solutions Engineer can help!).

Regarding the application layer: With Prometheus, it is common that developers will generate RED metrics using the Prometheus client libraries to get a better understanding of application state and performance — and be able to slice that data by metadata fields (URL, host, data center, and so on) — for much faster problem detection versus troubleshooting via logging alone.

For RED metrics, tracking 95th percentile latency could look this this:

ALERT High95thResponseTime
IF histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[1m])) by (le, service, route, method)) > 500
  FOR 60s
  ANNOTATIONS {
    summary = "95th percentile response time exceeded on {{ $labels.service }} and {{ $labels.method }} {{ $labels.route }}",
    description = "{{ $labels.service }}, {{ $labels.method }} {{ $labels.route }} has a 95th percentile response time above 500ms (current value: {{ $value }}ms)",
  }

Alert for pod restarts

From Heds Simons:

Originally: Summit ain’t deployed right, init. Or your node is fried.

Edited: Had I known Eldin was going to publish our conversation verbatim, I would not have invoked my native Birmingham, U.K., accent. :D

This particular alert is a pared-down example of using the changes aggregator to see how many times specified pods or containers have restarted in a time period — in this case, the last 5 minutes. There’s actually a case for bumping this number up slightly, and increasing the time period to ensure you don’t get a “false alert” on new deployments (although generally you’d know when you’re deploying). There’s also an argument for adding a for clause, although really, we do want to know if pods are restarting regardless of whether it occurs frequently in a time period.

alert: ETOOMANYRESTARTS
 expr: changes(kube_pod_container_status_restarts_total{<container/pod>=~"<name>|.*"}[5m]) > 2
 labels:
   severity: warning
 annotations:
   summary: "Pod restarts are occurring frequently"

Alert when something is too slow

From Éamon Ryan:

It’s all very well to alert when something is missing, or in the wrong state, but it’s important to also alert when things are behaving much slower than expected. What good is it for your load balancer to be online if every request takes 10 seconds? The way to write this into a rule will vary depending on the available metrics, but here is one example for HAproxy:

  - alert: HaproxyHttpSlowingDown
	expr: avg by (proxy) (haproxy_backend_max_total_time_seconds) > 1
	for: 1m
	labels:
  	severity: warning
	annotations:
  	summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
  	description: Average request time is increasing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

Some final notes

For our Grafana Cloud customers, we would like to remind you that we have a Prometheus-style UI for your alerts, recording rules, and Alertmanager. If you are unsure where to get started, we also have snazzy integrations, which offer you the right scrape configs, rules/alerts, and dashboards. And if you’re not already using Grafana Cloud, we have free and paid plans to suit every use case — sign up for free now.

Shout out to the best SE team for sharing their favorites. We’ve now covered Grafana transformations and Prometheus alerting rules. Let us know what you’d like us to talk about next!

Feedback

You should know about... these useful Prometheus alerting rules

The ‘up’ query

Alerts for USE and RED

Alert for pod restarts

Alert when something is too slow

Some final notes

Up next

Feedback

You should know about... these useful Prometheus alerting rules

The ‘up’ query

Alerts for USE and RED

Alert for pod restarts

Alert when something is too slow

Some final notes

Up next

What’s new in Grafana Cloud for March 2021: improvements to alerting, synthetic monitoring, and more

Step-by-step guide to setting up Prometheus Alertmanager with Slack, PagerDuty, and Gmail