Grafana Cloud

Respond to an alert

When an alert fires, your first task is to confirm what’s happening and identify what type of issue you’re dealing with. This page helps you triage quickly and route to the right investigation workflow.

Alerts and telemetry signals

Metrics are the most common signal for triggering alerts. They’re designed for efficient, continuous evaluation. Unlike logs or traces, metrics are pre-aggregated numerical time series that can be quickly queried to check thresholds, like CPU > 80% or error rate > 1%. Alert rules typically evaluate every 30-60 seconds, and metrics handle this constant querying with minimal overhead.

For logs, you can alert on LogQL queries that aggregate log data, for example, count_over_time() to alert when error log volume exceeds a threshold, or rate() to monitor log rates. For traces, you can use TraceQL metrics queries to alert on span error rates or latency percentiles. SQL data sources let you alert on database query results. This is useful for business metrics or application-level conditions that aren’t captured in time-series metrics.

The query must return numeric values that can be evaluated against thresholds. That’s why you typically use aggregation functions rather than querying raw logs or individual traces.

Example: Create an alert for high error rates

If you want to alert on high error rates in your application, you can create a Grafana alert rule that evaluates the error rate percentage. Refer to Introduction to Grafana Alerting for more information about how alerts work in Grafana Cloud.

To set up the alert, create a Grafana alert rule that queries your Prometheus data source using a PromQL expression like this:

promql
100 * sum(rate(http_requests_total{service="api_server",status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="api_server"}[$__rate_interval]))

Configure the alert rule to trigger when the query result exceeds your threshold (for example, 5 for a 5% error rate). You also define how often Grafana checks this condition (for example, every 1m) and how long the condition must be true before alerting (for example, 2m to prevent alerts from brief spikes). Behind the scenes, Grafana reduces the time series to a single value and compares it against your threshold.

Finally, route the alert to a contact point (for example, “On-Call Team”) that sends notifications to Slack and PagerDuty when the threshold is breached. When the error rate drops back to normal, the alert automatically resolves and notifies the team.

For step-by-step instructions on creating alert rules, refer to Configure Grafana-managed alert rules.

Scenario: Alert on high error rate

Here’s an example scenario for an alert on a high error rate.

Alert name: “High Error Rate - API Server” Triggered by a Prometheus query that evaluates every minute:

promql
sum(rate(http_requests_total{service="api_server",status=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="api_server"}[$__rate_interval]))
> 0.05

At 8:47 PM, the alert fires because the error rate jumped from the normal 0.2% to 6.3%. The notification includes the current value 6.3% and labels like service="api_server" and env="production".

Investigation flow using different signals:

  1. Metrics: Shows error rate spiked at 8:45 PM.
  2. Logs: Query {service_name="api_server"} | detected_level="error" for 8:45-8:50 PM, which reveals “database connection timeout” errors.
  3. Traces: Search TraceQL {resource.service.name="api_server" && status=error}, which shows failed requests are hitting the /users endpoint.
  4. Profiles: CPU profiling for api_server during 8:45-8:50 PM, which shows database connection pool exhausted.

This lets you go from “error rate too high” to “the database connection pool is exhausted on the users endpoint” in minutes.

Confirm the alert

Before diving into investigation, verify the alert is real:

  1. Check the alert details for the affected service, metric, and threshold.
  2. Open the alerting metric in Drilldown > Metrics or Explore.
  3. Confirm: Is the metric actually elevated? When did it start? Is it still happening?

If the alert is a false positive or has already resolved, acknowledge it and document why.

Example: Confirm the alert using Metrics Drilldown

For the above scenario, here’s how you would confirm the alert using Metrics Drilldown.

  1. Start with the service filter. Navigate to Drilldown > Metrics and filter by service="api_server" to scope all metrics to that service.

  2. Examine request rates. Select the http_requests_total metric to see the overall request rate timeline showing the spike at 8:45 PM.

  3. Break down by status code. Group by the status label to split the visualization by status codes. You see 5xx errors (500, 502, 503) increased sharply while 2xx responses stayed flat or dropped.

  4. Break down by endpoint. Further group by endpoint or path label to identify that /users endpoint specifically shows the error spike, while other endpoints like /health and /metrics remain stable.

  5. Check specific instances. Break down by instance or pod to see if all api_server instances are affected or just specific ones.

This visual exploration confirms the alert and narrows the scope from “api_server has high errors” to “the /users endpoint on api_server is throwing 503 errors across all instances starting at 8:45 PM.” This points you toward a dependency issue like the database connection pool problem.

Try the workflow

Want to try the workflow yourself? Use the public demo environment on play.grafana.org or Grafana Assistant in your own Grafana Cloud instance.

Quick triage with Grafana Assistant

If you have Grafana Cloud with Grafana Assistant, you can triage quickly with natural language:

  1. Open Grafana Assistant (Ctrl+I or Cmd+I).

  2. Describe the alert:

    “Is the checkout service having errors right now?”

    “Show latency for api-server in the last hour”

    “What’s the CPU usage for frontend pods?”

Assistant queries the right data sources and helps you identify the issue type faster.

Practice on play.grafana.org

Use the public demo environment to practice alert investigation with Metrics Drilldown.

Note

The demo environment doesn’t have the scenario’s HTTP request metrics, but you can practice the investigation workflow using synthetic monitoring metrics.

  1. Open play.grafana.org and navigate to Drilldown > Metrics.
  1. Search for probe_success and select probe_success.
  2. On the Breakdown tab, click Select on the probe label to see success rates for each probe location.
  3. Look for probe locations with lower success rates—these would be candidates for investigation.
  4. Click Add to filters on a specific probe location to drill down further.
  5. Navigate to Drilldown > Logs to see logs from services in the environment. Use the Filter by label values dropdown or the Add label tab to filter logs by service or other labels.

Identify the issue type

Based on what you observe, determine your next steps based on the issue type.

What you observeIssue typeWorkflow
Error rate increased, 5xx responses, failuresErrorsTroubleshoot an error
Latency percentiles elevated, slow responsesPerformanceInvestigate slow performance
CPU/memory spike, resource exhaustionResource issueTroubleshoot an error (check for runaway processes) or Investigate slow performance (check for load)
Already have a slow trace to investigateCode bottleneckFind slow code from a trace

Note

If you’re unsure which type of issue you’re dealing with, start with Troubleshoot an error. Errors are usually the fastest to confirm or rule out.

After investigation

After you’ve identified the root cause:

  1. Immediate: Address the symptom (scale resources, restart pods, block traffic).
  2. Short-term: Fix the trigger (add rate limiting, fix the bug, optimize the query).
  3. Long-term: Prevent recurrence (tune alert thresholds, update runbooks, improve instrumentation).