Grafana Cloud Enterprise Open source
Last reviewed: April 16, 2026

Alerting with the AWS Application Signals data source

Use Grafana Alerting with the AWS Application Signals data source to notify your team about fault rates, latency regressions, throttling, and error spikes before they become incidents.

The primary alerting target is the Trace Statistics query type, which returns per-bucket counts for errors, faults, throttled responses, successes, and totals, plus a computed Average Response Time. This page covers which query types can drive alerts, copy-paste-ready recipes for the four most common patterns (fault rate, error-and-fault count, throttling, latency), how to scope alerts to a linked AWS account, and the gotchas to watch for when you put rules into production.

Before you begin

Alerting-compatible query types

Not every query type returns data Grafana can alert on. Use this table to choose the right query type.

Query typeReturns numeric time seriesSupports alertingNotes
Trace StatisticsYesYesPrimary alerting target. Returns Throttle Count, Error Count, Fault Count, Success Count, Total Count, and Average Response Time per time bucket.
Trace ListNoNoReturns a trace table, not numeric series. Use a Trace Statistics query with the same filter expression instead.
Trace analyticsNoNoReturns root-cause summary tables. Use a Trace Statistics query if you want to alert on the same trace population.
InsightsNoNoReturns an insight summary table. Insights are correlated anomalies, not metrics — there’s no equivalent Trace Statistics conversion. To be notified when X-Ray detects new insights, configure an X-Ray insights notification in AWS instead.
Service MapNoNoReturns a graph visualization.
List services / operations / dependencies / SLOsNoNoThese queries return CloudWatch metric references (metric name, namespace, dimensions), not metric values. To alert on the underlying numbers, use the CloudWatch data source to query each metric reference, or set up native CloudWatch alarms on the Application Signals metrics directly.

Build an alert rule on Trace Statistics

The standard pattern is:

  1. Write a Trace Statistics query that returns the metric you want to alert on.
  2. Add a Reduce or Math expression that collapses the series into a single number.
  3. Set a threshold condition.
  4. Configure evaluation, labels, and notifications.

To create an alert:

  1. Click Alerts & IRM > Alerting > Alert rules in the left-side menu.
  2. Click New alert rule.
  3. Under Define query and alert condition, select your AWS Application Signals data source.
  4. Set the query:
    • Query Type: Trace Statistics
    • Region: the region where the services run
    • Query: a filter expression that scopes the population (for example, service("frontend"))
    • Group: optional; attach a pre-defined X-Ray group
    • Resolution: 60s or 300s (match to your evaluation interval)
    • Columns: select only the columns you need (for example, Error Count, Success Count, Total Count)
  5. Add a Reduce expression that reduces the Trace Statistics series to last, mean, or sum.
  6. Add a Threshold expression and set your condition.
  7. Configure Folder, Evaluation group and interval, and Labels and notifications.
  8. Click Save rule and exit.

Note

Trace Statistics buckets are aligned to the selected Resolution. Pick a Resolution that divides evenly into your alert evaluation interval so you don’t alert on partially-filled buckets.

Example: fault rate alert on a service

Alert when more than 1% of requests to the frontend service fault over the last five minutes.

A Trace Statistics query returns one series per selected column, so a Reduce expression against a multi-column query collapses every column at once and can’t pick out fault vs. total. The reliable pattern is two single-column queries that each reduce to one labeled value, then a Math expression that divides them.

Query A — fault count

FieldValue
Query TypeTrace Statistics
Regionus-east-1
Queryservice("frontend")
Resolution60s
ColumnsFault Count

Query B — total count

FieldValue
Query TypeTrace Statistics
Regionus-east-1
Queryservice("frontend")
Resolution60s
ColumnsTotal Count

Expression C — reduce fault count

FieldValue
InputA
Functionsum
ModeReplace non-numeric values with zero

Expression D — reduce total count

FieldValue
InputB
Functionsum
ModeReplace non-numeric values with zero

Expression E — fault ratio with divide-by-zero guard

FieldValue
TypeMath
Expression$C / ($D + ($D == 0))

Grafana Math expressions don’t support if/else or the ternary operator (?:), so the guard uses the boolean-multiplication trick: relational operators return 1 for true and 0 for false. When $D == 0, the divisor becomes 0 + 1 = 1 and the result is $C / 1. Because $C (fault count) can’t exceed $D (total count), $C is also 0 in that window, so the expression evaluates to 0 instead of NaN. When $D is non-zero, the ($D == 0) term is 0 and the expression reduces to $C / $D.

Expression F — threshold

FieldValue
InputE
ConditionIS ABOVE 0.01

Set the Evaluation interval to 1m and the Pending period to 5m so the rule fires after five consecutive minutes above the 1% threshold. Without the divide-by-zero guard, quiet traffic windows can put the rule into Error state. Refer to Fault-rate alert fires when there’s no traffic for background.

Example: error-and-fault count alert

Alert when the combined error and fault count for the checkout-api service exceeds 50 over a five-minute window. Because Trace Statistics doesn’t expose a combined error+fault column, sum the two columns in a Math expression.

Query A — error count

FieldValue
Query TypeTrace Statistics
Queryservice("checkout-api")
Resolution60s
ColumnsError Count

Query B — fault count

FieldValue
Query TypeTrace Statistics
Queryservice("checkout-api")
Resolution60s
ColumnsFault Count

Expression C — reduce error count

| Input | A | | Function | sum | | Mode | Replace non-numeric values with zero |

Expression D — reduce fault count

| Input | B | | Function | sum | | Mode | Replace non-numeric values with zero |

Expression E — sum error + fault

| Type | Math | | Expression | $C + $D |

Expression F — threshold

| Input | E | | Condition | IS ABOVE 50 |

Example: throttling alert

Alert when throttled responses exceed 10 per minute on the payments service. A sustained rise in throttles usually means a downstream quota (for example, DynamoDB or a third-party API) is saturated.

Query A — Trace Statistics

FieldValue
Query TypeTrace Statistics
Queryservice("payments")
Resolution60s
ColumnsThrottle Count

Expression B — reduce

| Input | A | | Function | last | | Mode | Replace non-numeric values with zero |

Expression C — threshold

| Input | B | | Condition | IS ABOVE 10 |

Throttling alerts are often paired with a secondary condition that requires traffic — for example, only fire when Total Count is also above some floor, so a dormant service doesn’t stay silent while still firing on a stale throttle burst.

Example: latency regression on average response time

Alert when the search-api service’s average response time exceeds 500 ms over a five-minute window.

Query A — Trace Statistics

FieldValue
Query TypeTrace Statistics
Queryservice("search-api")
Resolution60s
ColumnsAverage Response Time

Expression B — reduce

| Input | A | | Function | mean | | Mode | Drop non-numeric values |

Expression C — threshold

| Input | B | | Condition | IS ABOVE 0.5 |

Note

Average Response Time is returned in seconds, so use 0.5 for a 500 ms threshold, not 500. For short-duration regressions, lower the Pending period; for noisy traffic, use Mean of last 5 in a secondary Reduce stage to smooth spikes.

Example: group-driven alert

When you maintain a shared X-Ray group such as critical-paths in AWS, you can alert on that group’s traffic without hard-coding a filter expression in Grafana. Changing the group’s filter expression in AWS automatically changes the population the alert evaluates on.

Query A — fault count in the group

FieldValue
Query TypeTrace Statistics
Regionus-east-1
Queryleave empty
Groupcritical-paths
Resolution60s
ColumnsFault Count

Query B — total count in the group

FieldValue
Query TypeTrace Statistics
Regionus-east-1
Queryleave empty
Groupcritical-paths
Resolution60s
ColumnsTotal Count

Then reduce each query and divide them exactly as in the fault rate example.

Example: cross-account fault rate

When cross-account observability is configured, scope an alert to a specific linked account with the id(account.id: "...") selector.

Query A — fault count in a linked account

FieldValue
Query TypeTrace Statistics
Regionus-east-1
Queryservice(id(account.id: "123456789012"))
Resolution60s
ColumnsFault Count

Query B — total count in a linked account

FieldValue
Query TypeTrace Statistics
Regionus-east-1
Queryservice("frontend") { account.id = "123456789012" }
Resolution60s
ColumnsTotal Count

Then reduce each series and divide them exactly as in the fault rate example. To get one alert rule per linked account without duplicating rules, add labels such as account_id="123456789012" to the rule and route through a notification policy that matches on account_id.

Use template variables in alert queries

Grafana Alerting supports limited variable interpolation. To parameterize alerts:

  • Use Constant or Text box dashboard variables, not query-driven variables, because alert rules are evaluated without a dashboard context.
  • Prefer hard-coded service or account IDs in the alert filter expression when you need deterministic evaluation.
  • If you must parameterize an alert across multiple services, create one alert rule per service and use consistent label naming so a single notification policy can route them.

Alerting best practices

  • Alert on rates, not raw counts, wherever possible. Raw counts scale with traffic volume and cause noisy alerts.
  • Choose resolution carefully. Higher resolution (60s) gives faster detection but increases AWS X-Ray API calls. Use 300s for long-horizon alerts.
  • Use groups to stabilize filters. An X-Ray group lets you share a filter expression across multiple alerts and dashboards. If the group definition changes in AWS, all alerts update automatically.
  • Tune the pending period. A short pending period catches transient issues; a longer pending period avoids noise from brief blips.
  • Label alerts with ownership metadata. Add team and service labels so notification policies can route to the right channel.
  • Mind X-Ray API throttling. If you create many frequent alert rules, you can exceed X-Ray API limits. Consolidate rules or raise Resolution to reduce call frequency.

Caution

Trace Statistics data can arrive late. Traces are typically indexed within a minute, but AWS may take longer during high-volume periods. Set the alert For duration (pending period) to at least two evaluation intervals to avoid firing on a temporarily empty bucket.

Alerting on Application Signals SLOs

The plugin’s List Service Level Objectives (SLO) query returns SLO metadata — name, operation, creation time, key attributes — not a numeric attainment or burn-rate value, so you can’t alert on it directly in Grafana.

For production SLO alerting, use one of these options instead:

  • Native CloudWatch alarms on the SLO metrics Application Signals publishes. Configure the alarms in AWS (in the Application Signals or CloudWatch console) and let them fire into your existing incident pipeline. This is the tightest integration with the AWS SLO dashboards.
  • Grafana alerts on the CloudWatch SLO metrics. Query the same metrics through the CloudWatch data source and build Grafana-managed alert rules on them — useful when you want SLO alerts to route through the same notification policies as the rest of your Grafana alerting stack.
  • Burn-rate proxies from Trace Statistics. If you can’t query the CloudWatch SLO metrics, approximate the same signal with a Trace Statistics fault-rate alert on the service + operation the SLO covers. This won’t match AWS’s SLO evaluation exactly but catches the same underlying failures.

Next steps