---
title: "Respond to an alert | Grafana Cloud documentation"
description: "Triage an alert and route to the right investigation workflow."
---

> For a curated documentation index, see [llms.txt](/llms.txt). For the complete documentation index, see [llms-full.txt](/llms-full.txt).

# Respond to an alert

When an alert fires, your first task is to confirm what’s happening and identify what type of issue you’re dealing with. This page helps you triage quickly and route to the right investigation workflow.

## Alerts and telemetry signals

Metrics are the most common signal for triggering alerts. They’re designed for efficient, continuous evaluation. Unlike logs or traces, metrics are pre-aggregated numerical time series that can be quickly queried to check thresholds, like `CPU > 80%` or `error rate > 1%`. Alert rules typically evaluate every 30-60 seconds, and metrics handle this constant querying with minimal overhead.

For logs, you can alert on LogQL queries that aggregate log data, for example, `count_over_time()` to alert when error log volume exceeds a threshold, or `rate()` to monitor log rates. For traces, you can use TraceQL metrics queries to alert on span error rates or latency percentiles. SQL data sources let you alert on database query results. This is useful for business metrics or application-level conditions that aren’t captured in time-series metrics.

The query must return numeric values that can be evaluated against thresholds. That’s why you typically use aggregation functions rather than querying raw logs or individual traces.

### Example: Create an alert for high error rates

If you want to alert on high error rates in your application, you can create a Grafana alert rule that evaluates the error rate percentage. Refer to [Introduction to Grafana Alerting](/docs/grafana-cloud/alerting-and-irm/alerting/) for more information about how alerts work in Grafana Cloud.

To set up the alert, create a Grafana alert rule that queries your Prometheus data source using a PromQL expression like this:

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
100 * sum(rate(http_requests_total{service="api_server",status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="api_server"}[$__rate_interval]))
```

Configure the alert rule to trigger when the query result exceeds your threshold (for example, 5 for a 5% error rate). You also define how often Grafana checks this condition (for example, every 1m) and how long the condition must be true before alerting (for example, 2m to prevent alerts from brief spikes). Behind the scenes, Grafana reduces the time series to a single value and compares it against your threshold.

Finally, route the alert to a contact point (for example, “On-Call Team”) that sends notifications to Slack and PagerDuty when the threshold is breached. When the error rate drops back to normal, the alert automatically resolves and notifies the team.

For step-by-step instructions on creating alert rules, refer to [Configure Grafana-managed alert rules](/docs/grafana-cloud/alerting-and-irm/alerting/alerting-rules/create-grafana-managed-rule/).

## Scenario: Alert on high error rate

Here’s an example scenario for an alert on a high error rate.

Alert name: “High Error Rate - API Server” Triggered by a Prometheus query that evaluates every minute:

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
sum(rate(http_requests_total{service="api_server",status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="api_server"}[$__rate_interval]))
> 0.05
```

At 8:47 PM, the alert fires because the error rate jumped from the normal `0.2%` to `6.3%`. The notification includes the current value `6.3%` and labels like `service="api_server"` and `env="production"`.

Investigation flow using different signals:

1. Metrics: Shows error rate spiked at 8:45 PM.
2. Logs: Query `{service_name="api_server"} | detected_level="error"` for 8:45-8:50 PM, which reveals “database connection timeout” errors.
3. Traces: Search TraceQL `{resource.service.name="api_server" && status=error}`, which shows failed requests are hitting the `/users` endpoint.
4. Profiles: CPU profiling for `api_server` during 8:45-8:50 PM, which shows database connection pool exhausted.

This lets you go from “error rate too high” to “the database connection pool is exhausted on the users endpoint” in minutes.

## Confirm the alert

Before diving into investigation, verify the alert is real:

1. Check the alert details for the affected service, metric, and threshold.
2. Open the alerting metric in **Drilldown** &gt; **Metrics** or **Explore**.
3. Confirm: Is the metric actually elevated? When did it start? Is it still happening?

If the alert is a false positive or has already resolved, acknowledge it and document why.

### Example: Confirm the alert using Metrics Drilldown

For the above scenario, here’s how you would confirm the alert using Metrics Drilldown.

1. Start with the service filter. Navigate to **Drilldown** &gt; **Metrics** and filter by `service="api_server"` to scope all metrics to that service.
2. Examine request rates. Select the `http_requests_total` metric to see the overall request rate timeline showing the spike at 8:45 PM.
3. Break down by status code. Group by the `status` label to split the visualization by status codes. You see 5xx errors (`500`, `502`, `503`) increased sharply while 2xx responses stayed flat or dropped.
4. Break down by endpoint. Further group by `endpoint` or `path` label to identify that `/users` endpoint specifically shows the error spike, while other endpoints like `/health` and `/metrics` remain stable.
5. Check specific instances. Break down by `instance` or `pod` to see if all `api_server` instances are affected or just specific ones.

This visual exploration confirms the alert and narrows the scope from “`api_server` has high errors” to “the `/users` endpoint on `api_server` is throwing `503` errors across all instances starting at 8:45 PM.” This points you toward a dependency issue like the database connection pool problem.

## Try the workflow

Want to try the workflow yourself? Use the public demo environment on [play.grafana.org](https://play.grafana.org) or Grafana Assistant in your own Grafana Cloud instance.

### Quick triage with Grafana Assistant

If you have Grafana Cloud with [Grafana Assistant](/docs/grafana-cloud/machine-learning/assistant/), you can triage quickly with natural language:

1. Click the **sparkle icon** in the top navigation bar to open **Grafana Assistant**.
2. Describe the alert:
   
   > “Is the checkout service having errors right now?”
   > 
   > “Show latency for api-server in the last hour”
   > 
   > “What’s the CPU usage for frontend pods?”

Assistant queries the right data sources and helps you identify the issue type faster.

### Practice on play.grafana.org

Use the public demo environment to practice alert investigation with Metrics Drilldown.

> Note
> 
> The demo environment doesn’t have the scenario’s HTTP request metrics, but you can practice the investigation workflow using synthetic monitoring metrics.

1. Open [play.grafana.org](https://play.grafana.org) and navigate to **Drilldown** &gt; **Metrics**.

<!--THE END-->

1. Search for `probe_success` and select **probe\_success**.
2. On the **Breakdown** tab, click **Select** on the **probe** label to see success rates for each probe location.
3. Look for probe locations with lower success rates—these would be candidates for investigation.
4. Click **Add to filters** on a specific probe location to drill down further.
5. Navigate to **Drilldown** &gt; **Logs** to see logs from services in the environment. Use the **Filter by label values** dropdown or the **Add label** tab to filter logs by service or other labels.

## Identify the issue type

Based on what you observe, determine your next steps based on the issue type.

Expand table

| What you observe                              | Issue type      | Workflow                                                                                                                                                                                                                                                     |
|-----------------------------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Error rate increased, 5xx responses, failures | Errors          | [Troubleshoot an error](/docs/grafana-cloud/telemetry-signals/workflows/troubleshoot-error/)                                                                                                                                                                 |
| Latency percentiles elevated, slow responses  | Performance     | [Investigate slow performance](/docs/grafana-cloud/telemetry-signals/workflows/investigate-slow-performance/)                                                                                                                                                |
| CPU/memory spike, resource exhaustion         | Resource issue  | [Troubleshoot an error](/docs/grafana-cloud/telemetry-signals/workflows/troubleshoot-error/) (check for runaway processes) or [Investigate slow performance](/docs/grafana-cloud/telemetry-signals/workflows/investigate-slow-performance/) (check for load) |
| Already have a slow trace to investigate      | Code bottleneck | [Find slow code from a trace](/docs/grafana-cloud/telemetry-signals/workflows/find-slow-code-from-trace/)                                                                                                                                                    |

> Note
> 
> If you’re unsure which type of issue you’re dealing with, start with [Troubleshoot an error](/docs/grafana-cloud/telemetry-signals/workflows/troubleshoot-error/). Errors are usually the fastest to confirm or rule out.

## After investigation

After you’ve identified the root cause:

1. Immediate: Address the symptom (scale resources, restart pods, block traffic).
2. Short-term: Fix the trigger (add rate limiting, fix the bug, optimize the query).
3. Long-term: Prevent recurrence (tune alert thresholds, update runbooks, improve instrumentation).
