Prometheus alerting
You can use Grafana Cloud to avoid installing, maintaining, and scaling your own instance of Grafana. Create a free account to get started, which includes free forever access to 10k metrics, 50GB logs, 50GB traces, 500VUh k6 testing & more.
You can use Grafana Alerting with Prometheus to create alerts based on your time-series data. This allows you to monitor metrics, detect anomalies, and receive notifications when specific conditions are met.
For general information about Grafana Alerting, refer to Grafana Alerting.
Before you begin
Before creating alerts with Prometheus, ensure you have:
- A Prometheus data source configured in Grafana
- Appropriate permissions to create alert rules
- Understanding of the PromQL metrics you want to monitor
Alert rule types
Prometheus supports two alerting workflows in Grafana:
Create a Grafana-managed alert rule
To create a Grafana-managed alert rule using Prometheus:
- Navigate to Alerting > Alert rules.
- Click New alert rule.
- Enter a name for the alert rule.
- Select your Prometheus data source.
- Write a PromQL query in the query editor.
- Configure the alert condition (for example, when the last value is above a threshold).
- Set the evaluation interval and pending period.
- Configure notifications and labels.
- Click Save rule.
For detailed instructions, refer to Create a Grafana-managed alert rule.
View data source-managed rules
When Manage alerts via Alerting UI is enabled in the Prometheus data source configuration, Grafana fetches and displays alerting rules defined in Prometheus. These appear in the Alerting UI alongside Grafana-managed rules but are marked as data source-managed.
For Prometheus data sources, this view is read-only. To modify these rules, update your Prometheus rule files directly.
Note
For Mimir and Cortex data sources, the Alerting UI supports both viewing and creating data source-managed rules. Prometheus only supports viewing.
Evaluation groups and intervals
Alert rules are organized into evaluation groups. Each group has an evaluation interval that determines how frequently the rules in that group are evaluated. For example, an evaluation interval of 1m means the alert query runs every 60 seconds.
The pending period determines how long a condition must be continuously true before the alert fires. For example, with a 1-minute evaluation interval and a 5-minute pending period, the condition must be true for 5 consecutive evaluations before firing.
Choose evaluation intervals based on your use case:
- 15s–30s — Critical infrastructure alerts where fast detection matters.
- 1m — Standard monitoring alerts (recommended default).
- 5m — Non-urgent or noisy metrics where you want to reduce evaluation load.
Example alert queries
The following examples show common alerting scenarios with Prometheus. Each example shows the PromQL query and how to configure the alert condition.
Alert on high CPU usage
Monitor CPU usage and alert when it exceeds 90%:
Query A:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)Condition: Set the threshold to alert when the last value of A is above 90.
This approach separates the metric query from the threshold, making it easier to adjust the threshold later without editing the PromQL.
Alert on high memory usage
Monitor memory usage across nodes:
Query A:
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100Condition: Alert when the last value of A exceeds 85.
Alert on high error rate
Monitor HTTP error rates per service:
Query A:
sum(rate(http_requests_total{status=~"5.."}[$__rate_interval])) by (job)
/
sum(rate(http_requests_total[$__rate_interval])) by (job)
* 100Condition: Alert when the last value of A exceeds 5 (meaning error rate above 5%).
Alert on target down
Monitor whether Prometheus scrape targets are reachable:
Query A:
up{job="myservice"}Condition: Alert when the last value of A is below 1.
Alert when a metric disappears
Use absent() to detect when a metric stops being scraped entirely — for example, when a service crashes and no longer reports metrics:
Query A:
absent(up{job="myservice"})Condition: Alert when the last value of A equals 1 (the absent() function returns 1 when the metric is missing, and nothing when the metric exists).
For detecting staleness over a time window (metric exists but hasn’t reported recently):
absent_over_time(up{job="myservice"}[5m])Multi-condition alert (high latency AND high traffic)
Use multiple queries and expressions to alert only when multiple conditions are true simultaneously. This reduces noise by avoiding alerts during low-traffic periods.
Query A — P95 latency:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[$__rate_interval])) by (le))Query B — Request rate:
sum(rate(http_requests_total{job="api"}[$__rate_interval]))Expression C — Math (both conditions must be true):
$A > 2 && $B > 100Condition: Alert when C has a value (it only returns data when both latency exceeds 2 seconds AND request rate exceeds 100 req/s).
Recording rules
Grafana-managed recording rules pre-compute expensive PromQL expressions on a schedule and write the results back to a Prometheus-compatible data source as a new metric. This lets you query the pre-aggregated metric instead of re-evaluating the expensive expression on every dashboard load or alert evaluation.
When to use recording rules
- Dashboard panels that query the same expensive expression repeatedly — pre-compute it once and query the result.
- Alert rules on complex expressions — simplify the alert query by alerting on the pre-aggregated metric.
- High-cardinality aggregations — reduce thousands of series to a handful of pre-computed series.
Set up Grafana-managed recording rules for Prometheus
Enable the data source as a recording rules target: In the Prometheus data source configuration, verify that Allow as recording rules target is toggled on (it’s on by default). This allows Grafana to write recording rule results back to this instance.
Verify write access: The Prometheus-compatible backend must support remote write. Grafana Cloud Metrics (Mimir) and self-hosted Mimir support this natively. Standard Prometheus requires the
--web.enable-remote-write-receiverflag (Prometheus 2.33+).Create a recording rule:
Navigate to Alerting > Alert rules.
Click New alert rule.
Select Recording rule as the rule type (under the Grafana-managed section).
Enter the PromQL expression you want to pre-compute:
sum(rate(http_requests_total[$__rate_interval])) by (service)Enter a metric name for the result (for example,
service:http_requests:rate5m). Follow the Prometheus recording rule naming convention:level:metric:operations.Select the Target data source — the Prometheus instance where results will be written.
Set the evaluation interval (for example, every 1 minute).
Click Save rule.
Query the recorded metric: After the first evaluation, the new metric is available for dashboards and alerts:
service:http_requests:rate5m{service="api"}
Recording rules with PDC
If your Prometheus instance is behind Private data source connect (PDC), Grafana can write recording rule results through the PDC tunnel. No additional configuration is needed — PDC supports both reads and writes.
Limitations
- Recording rules require the target data source to support remote write. Standard Prometheus needs the
--web.enable-remote-write-receiverflag. - Thanos does not support recording rules as a write target (refer to the Prometheus type comparison).
- Recording rule evaluation uses the configured evaluation interval, not dashboard time ranges. Use a fixed range vector (for example,
[5m]) rather than$__rate_interval.
For more information, refer to Create Grafana-managed recording rules.
Limitations
When using Prometheus with Grafana Alerting, be aware of the following limitations.
Template variables not supported
Alert queries don’t support template variables. Grafana evaluates alert rules on the backend without dashboard context, so variables like $instance or $job aren’t resolved.
If your dashboard query uses template variables, create a separate query for alerting with hard-coded values or use label matchers directly.
Query complexity
Complex queries with many nested functions or large result sets may timeout or fail to evaluate. Simplify queries for alerting by:
- Reducing the time range used in range vectors
- Using appropriate aggregation to limit the number of returned series
- Adding label filters to narrow the data scanned
- Using recording rules to pre-compute expensive expressions
OAuth token handling differs between Explore and Alerting
When using OAuth-authenticated Prometheus endpoints (Google Managed Prometheus, Azure Managed Prometheus), queries may succeed in Explore and dashboards but fail intermittently during alert evaluation. This happens because the alerting backend handles token refresh differently from the interactive query path.
If you’re using GCP, consider the datasource-syncer pattern — a sidecar process that refreshes OAuth tokens and updates the data source credentials on a schedule shorter than the token lifetime.
For detailed troubleshooting steps, refer to OAuth token expiration errors.
Data source-managed rules are read-only
Grafana can display Prometheus alerting rules but can’t create or modify them through the UI. To manage Prometheus-native alerting rules, edit your Prometheus rule files directly and reload the configuration.
Configure alert state for execution errors
By default, when a Grafana-managed alert rule encounters an execution error or timeout (such as a network blip, i/o timeout, or a transient 502 from Prometheus), the rule enters an Error state — which fires the alert. This can cause false alarms and spam on-call teams when the underlying issue is a brief connectivity interruption rather than a genuine threshold breach.
To prevent false positives from transient errors, configure the Alert state if execution error or timeout setting on each alert rule:
- Open the alert rule for editing.
- In the alert conditions section, locate Alert state if execution error or timeout.
- Change the value from Alerting (default) to one of:
- Keep Last State — The alert retains its previous state (firing or normal) until a successful evaluation occurs. This is the recommended setting for most Prometheus alert rules.
- OK — The alert is set to normal during the error, preventing it from firing.
- Click Save rule.
Note
If your alert rules frequently enter an error state, investigate the root cause (network stability, Prometheus resource limits, query timeout settings) rather than relying solely on this setting to suppress notifications.
Common transient errors that trigger this behavior include:
sse.dependencyErrororsse.dataQueryErrorin alert state history- “context deadline exceeded” or “i/o timeout” messages
- HTTP 502 or 500 responses from the Prometheus server
For more details on troubleshooting these errors, refer to Troubleshoot Prometheus data source issues.
Best practices
Follow these best practices when creating Prometheus alerts:
- Configure error state handling: Set Alert state if execution error or timeout to Keep Last State to prevent transient backend errors from triggering false alarms.
- Use
$__rate_interval: When usingrate()orincrease()in alert queries, use$__rate_intervalto ensure the range window is always large enough relative to the scrape interval. Grafana resolves this variable based on the evaluation interval and scrape interval configuration. - Add label filters: Include specific label matchers to focus on relevant data and improve query performance.
- Set realistic pending periods: Use the pending period to avoid alerting on brief spikes. For example, set a 5-minute pending period so the condition must persist before firing.
- Test queries first: Verify your query returns expected results in Explore before creating an alert.
- Use meaningful names: Give alert rules descriptive names that indicate what they monitor and the severity.
- Pre-aggregate with recording rules: For complex or frequently evaluated expressions, create recording rules and alert on the pre-aggregated metric.
- Use
absent()for availability monitoring: Detect when metrics stop being reported, which often indicates a crashed or unresponsive service.
If you encounter errors when creating or evaluating alert rules, refer to Troubleshoot Prometheus data source issues.

