AWS / CloudWatch / OpenSearch

AWS OpenSearch Service monitoring focused on a selected domain. Compact fleet overview at the top helps you pick a domain; sections below show cluster health, resources, storage, search/indexing throughput, and throttling errors for the selected domain.

AWS / CloudWatch / OpenSearch screenshot 1
AWS / CloudWatch / OpenSearch screenshot 2
AWS / CloudWatch / OpenSearch screenshot 3
AWS / CloudWatch / OpenSearch screenshot 4
AWS / CloudWatch / OpenSearch screenshot 5

AWS OpenSearch Service Monitoring Dashboard for Grafana - CloudWatch Metrics

A Grafana dashboard for monitoring AWS OpenSearch Service (formerly Amazon Elasticsearch Service) domains using native CloudWatch metrics. Track cluster health (green/yellow/red), shard states, JVM memory pressure, garbage collection, search and indexing latency, thread-pool rejections, and snapshot failures - all without installing the OpenSearch Prometheus exporter.

Why monitor AWS OpenSearch Service?

OpenSearch failure modes are predictable: a node falls behind, replicas go unassigned (yellow), then a primary fails (red, writes blocked); or JVM memory pressure stays above 92% long enough that GC eats CPU and queries time out; or the write thread pool saturates, queues fill, and indexing rejections start. Each of these has clear leading indicators in CloudWatch - this dashboard puts them all on one screen so you can act before a domain goes red.

It works equally well for OpenSearch and the legacy Elasticsearch service since both publish to the same AWS/ES namespace.

Features

  • Fleet overview - domain count in the region, count of yellow domains, count of red domains, average cluster CPU, top 10 domains by storage and search rate
  • Cluster health - current status (green/yellow/red mapped to colored label), node count, active shards, unassigned shards
  • Shard states over time - active, unassigned, initializing, relocating, delayed-unassigned
  • Resources - CPU and JVM memory pressure (avg and max), JVM garbage collection counts and time (young and old gen)
  • Storage - free storage on the worst node, total cluster used space, storage utilization %, growth trends
  • Search and indexing - search rate, search latency (avg + p99), indexing rate, indexing latency (avg + p99)
  • Throttling and errors - thread-pool search/write rejections per second, search/write queue depth, HTTP response codes (2xx/3xx/4xx/5xx), automated snapshot failures

Key CloudWatch Metrics Used

All metrics are from the AWS/ES namespace, dimensioned by ClientId and DomainName (and Node or TargetGroup where applicable).

ClusterStatus.green / yellow / red

Three binary metrics - exactly one is 1, the others are 0. The dashboard maps them to a single colored label panel.

Nodes, Shards.active, Shards.unassigned, Shards.initializing, Shards.relocating

Cluster topology. Sustained Shards.unassigned > 0 usually means a node failed, the disk-low watermark was breached, or replicas have nowhere to land.

CPUUtilization, JVMMemoryPressure

The two resource metrics that actually predict outages. AWS recommends alarming JVM memory pressure at sustained >80%; above 92% means GC thrash is imminent.

JVMGCYoungCollectionCount, JVMGCYoungCollectionTime, JVMGCOldCollectionCount, JVMGCOldCollectionTime

Garbage collection counts and total time. Frequent old-gen GCs with rising time = heap pressure; scale up or add nodes.

FreeStorageSpace, ClusterUsedSpace

Free space on the worst node (Min stat across nodes) vs total cluster used. Rejects writes when nodes hit the disk-low watermark, so alert well before then.

SearchRate, SearchLatency, IndexingRate, IndexingLatency

Throughput and latency for queries and ingestion. CloudWatch supports p99 natively on the latency metrics - the dashboard uses both avg and p99 to expose tail latency.

ThreadpoolSearchRejected, ThreadpoolWriteRejected, ThreadpoolSearchQueue, ThreadpoolWriteQueue

Rejections only emit when non-zero, so any signal here is a real problem. Watch the queue lines lead the rejection lines - when queues fill up, rejections follow.

AutomatedSnapshotFailure

1 when the last automatic snapshot failed. Non-zero needs investigation immediately; backups are your safety net.

2xx / 3xx / 4xx / 5xx

HTTP response code counts from the domain's REST API. Sudden 4xx surges often mean a client started sending bad queries; 5xx surges = the cluster is in trouble.

Prerequisites

  • Grafana 10.0 or later
  • AWS CloudWatch datasource plugin configured in Grafana
  • IAM permissions on the role/user backing the datasource:
    • cloudwatch:GetMetricData
    • cloudwatch:ListMetrics
    • cloudwatch:GetMetricStatistics

No OpenSearch domain access policy changes are required since metrics flow through CloudWatch.

Installation

  1. Download the dashboard JSON.
  2. In Grafana, go to Dashboards → New → Import.
  3. Paste the JSON or upload the file.
  4. When prompted, select your AWS CloudWatch datasource.
  5. Click Import.

Variables

  • Region - AWS region of your OpenSearch domain
  • Domain - auto-populated from CloudWatch with every domain that publishes the Nodes metric
  • Period - CloudWatch aggregation period (60s, 300s, or 3600s)

Troubleshooting

Q: The Domains count says 0 even though I have domains. A: The fleet overview counts only domains currently publishing the Nodes metric. Newly created or paused domains may not appear for a few minutes.

Q: Search/Write Rejected panels are always empty. A: That's healthy - AWS only emits these metrics when non-zero. Empty means no rejections happened in the time range.

Q: Healthy/Unhealthy host panels don't match the console. A: The dashboard sums HealthyHostCount / UnHealthyHostCount across (TargetGroup, AvailabilityZone) pairs because CloudWatch doesn't emit these without both dimensions. The total matches the console.

Q: Will this work with the older Elasticsearch service? A: Yes - both legacy Amazon Elasticsearch Service and OpenSearch Service publish to the AWS/ES namespace with the same metric names.

Tags

aws opensearch elasticsearch search cloudwatch monitoring aws-monitoring opensearch-service aws-opensearch cluster-health jvm grafana-dashboard

Revisions
RevisionDescriptionCreated
AWS

AWS

by Grafana Labs
Grafana Labs solution

Easily visualize and alert on more than 60 Amazon Web Services (AWS) resources using the fully managed Grafana Cloud platform.

Learn more

Get this dashboard

Import the dashboard template

or

Download JSON

Datasource
Dependencies