Alerting with the AWS Application Signals data source
Use Grafana Alerting with the AWS Application Signals data source to notify your team about fault rates, latency regressions, throttling, and error spikes before they become incidents.
The primary alerting target is the Trace Statistics query type, which returns per-bucket counts for errors, faults, throttled responses, successes, and totals, plus a computed Average Response Time. This page covers which query types can drive alerts, copy-paste-ready recipes for the four most common patterns (fault rate, error-and-fault count, throttling, latency), how to scope alerts to a linked AWS account, and the gotchas to watch for when you put rules into production.
Before you begin
- Configure the AWS Application Signals data source.
- Read the AWS Application Signals query editor documentation and confirm Trace Statistics queries return the data you expect.
- Understand the fundamentals of Grafana Alerting and how to create an alert rule.
Alerting-compatible query types
Not every query type returns data Grafana can alert on. Use this table to choose the right query type.
Build an alert rule on Trace Statistics
The standard pattern is:
- Write a Trace Statistics query that returns the metric you want to alert on.
- Add a Reduce or Math expression that collapses the series into a single number.
- Set a threshold condition.
- Configure evaluation, labels, and notifications.
To create an alert:
- Click Alerts & IRM > Alerting > Alert rules in the left-side menu.
- Click New alert rule.
- Under Define query and alert condition, select your AWS Application Signals data source.
- Set the query:
- Query Type: Trace Statistics
- Region: the region where the services run
- Query: a filter expression that scopes the population (for example,
service("frontend")) - Group: optional; attach a pre-defined X-Ray group
- Resolution: 60s or 300s (match to your evaluation interval)
- Columns: select only the columns you need (for example, Error Count, Success Count, Total Count)
- Add a Reduce expression that reduces the Trace Statistics series to
last,mean, orsum. - Add a Threshold expression and set your condition.
- Configure Folder, Evaluation group and interval, and Labels and notifications.
- Click Save rule and exit.
Note
Trace Statistics buckets are aligned to the selected Resolution. Pick a Resolution that divides evenly into your alert evaluation interval so you don’t alert on partially-filled buckets.
Example: fault rate alert on a service
Alert when more than 1% of requests to the frontend service fault over the last five minutes.
A Trace Statistics query returns one series per selected column, so a Reduce expression against a multi-column query collapses every column at once and can’t pick out fault vs. total. The reliable pattern is two single-column queries that each reduce to one labeled value, then a Math expression that divides them.
Query A — fault count
Query B — total count
Expression C — reduce fault count
Expression D — reduce total count
Expression E — fault ratio with divide-by-zero guard
Grafana Math expressions don’t support if/else or the ternary operator (?:), so the guard uses the boolean-multiplication trick: relational operators return 1 for true and 0 for false. When $D == 0, the divisor becomes 0 + 1 = 1 and the result is $C / 1. Because $C (fault count) can’t exceed $D (total count), $C is also 0 in that window, so the expression evaluates to 0 instead of NaN. When $D is non-zero, the ($D == 0) term is 0 and the expression reduces to $C / $D.
Expression F — threshold
Set the Evaluation interval to 1m and the Pending period to 5m so the rule fires after five consecutive minutes above the 1% threshold. Without the divide-by-zero guard, quiet traffic windows can put the rule into Error state. Refer to Fault-rate alert fires when there’s no traffic for background.
Example: error-and-fault count alert
Alert when the combined error and fault count for the checkout-api service exceeds 50 over a five-minute window. Because Trace Statistics doesn’t expose a combined error+fault column, sum the two columns in a Math expression.
Query A — error count
Query B — fault count
Expression C — reduce error count
| Input | A |
| Function | sum |
| Mode | Replace non-numeric values with zero |
Expression D — reduce fault count
| Input | B |
| Function | sum |
| Mode | Replace non-numeric values with zero |
Expression E — sum error + fault
| Type | Math |
| Expression | $C + $D |
Expression F — threshold
| Input | E |
| Condition | IS ABOVE 50 |
Example: throttling alert
Alert when throttled responses exceed 10 per minute on the payments service. A sustained rise in throttles usually means a downstream quota (for example, DynamoDB or a third-party API) is saturated.
Query A — Trace Statistics
Expression B — reduce
| Input | A |
| Function | last |
| Mode | Replace non-numeric values with zero |
Expression C — threshold
| Input | B |
| Condition | IS ABOVE 10 |
Throttling alerts are often paired with a secondary condition that requires traffic — for example, only fire when Total Count is also above some floor, so a dormant service doesn’t stay silent while still firing on a stale throttle burst.
Example: latency regression on average response time
Alert when the search-api service’s average response time exceeds 500 ms over a five-minute window.
Query A — Trace Statistics
Expression B — reduce
| Input | A |
| Function | mean |
| Mode | Drop non-numeric values |
Expression C — threshold
| Input | B |
| Condition | IS ABOVE 0.5 |
Note
Average Response Time is returned in seconds, so use
0.5for a 500 ms threshold, not500. For short-duration regressions, lower the Pending period; for noisy traffic, useMean of last 5in a secondary Reduce stage to smooth spikes.
Example: group-driven alert
When you maintain a shared X-Ray group such as critical-paths in AWS, you can alert on that group’s traffic without hard-coding a filter expression in Grafana. Changing the group’s filter expression in AWS automatically changes the population the alert evaluates on.
Query A — fault count in the group
Query B — total count in the group
Then reduce each query and divide them exactly as in the fault rate example.
Example: cross-account fault rate
When cross-account observability is configured, scope an alert to a specific linked account with the id(account.id: "...") selector.
Query A — fault count in a linked account
Query B — total count in a linked account
Then reduce each series and divide them exactly as in the fault rate example. To get one alert rule per linked account without duplicating rules, add labels such as account_id="123456789012" to the rule and route through a notification policy that matches on account_id.
Use template variables in alert queries
Grafana Alerting supports limited variable interpolation. To parameterize alerts:
- Use Constant or Text box dashboard variables, not query-driven variables, because alert rules are evaluated without a dashboard context.
- Prefer hard-coded service or account IDs in the alert filter expression when you need deterministic evaluation.
- If you must parameterize an alert across multiple services, create one alert rule per service and use consistent label naming so a single notification policy can route them.
Alerting best practices
- Alert on rates, not raw counts, wherever possible. Raw counts scale with traffic volume and cause noisy alerts.
- Choose resolution carefully. Higher resolution (
60s) gives faster detection but increases AWS X-Ray API calls. Use300sfor long-horizon alerts. - Use groups to stabilize filters. An X-Ray group lets you share a filter expression across multiple alerts and dashboards. If the group definition changes in AWS, all alerts update automatically.
- Tune the pending period. A short pending period catches transient issues; a longer pending period avoids noise from brief blips.
- Label alerts with ownership metadata. Add
teamandservicelabels so notification policies can route to the right channel. - Mind X-Ray API throttling. If you create many frequent alert rules, you can exceed X-Ray API limits. Consolidate rules or raise Resolution to reduce call frequency.
Caution
Trace Statistics data can arrive late. Traces are typically indexed within a minute, but AWS may take longer during high-volume periods. Set the alert For duration (pending period) to at least two evaluation intervals to avoid firing on a temporarily empty bucket.
Alerting on Application Signals SLOs
The plugin’s List Service Level Objectives (SLO) query returns SLO metadata — name, operation, creation time, key attributes — not a numeric attainment or burn-rate value, so you can’t alert on it directly in Grafana.
For production SLO alerting, use one of these options instead:
- Native CloudWatch alarms on the SLO metrics Application Signals publishes. Configure the alarms in AWS (in the Application Signals or CloudWatch console) and let them fire into your existing incident pipeline. This is the tightest integration with the AWS SLO dashboards.
- Grafana alerts on the CloudWatch SLO metrics. Query the same metrics through the CloudWatch data source and build Grafana-managed alert rules on them — useful when you want SLO alerts to route through the same notification policies as the rest of your Grafana alerting stack.
- Burn-rate proxies from Trace Statistics. If you can’t query the CloudWatch SLO metrics, approximate the same signal with a Trace Statistics fault-rate alert on the service + operation the SLO covers. This won’t match AWS’s SLO evaluation exactly but catches the same underlying failures.


