AWS / CloudWatch / OpenSearch

AWS OpenSearch Service monitoring focused on a selected domain. Compact fleet overview at the top helps you pick a domain; sections below show cluster health, resources, storage, search/indexing throughput, and throttling errors for the selected domain.

AWS OpenSearch Service Monitoring Dashboard for Grafana - CloudWatch Metrics

A Grafana dashboard for monitoring AWS OpenSearch Service (formerly Amazon Elasticsearch Service) domains using native CloudWatch metrics. Track cluster health (green/yellow/red), shard states, JVM memory pressure, garbage collection, search and indexing latency, thread-pool rejections, and snapshot failures - all without installing the OpenSearch Prometheus exporter.

Why monitor AWS OpenSearch Service?

OpenSearch failure modes are predictable: a node falls behind, replicas go unassigned (yellow), then a primary fails (red, writes blocked); or JVM memory pressure stays above 92% long enough that GC eats CPU and queries time out; or the write thread pool saturates, queues fill, and indexing rejections start. Each of these has clear leading indicators in CloudWatch - this dashboard puts them all on one screen so you can act before a domain goes red.

It works equally well for OpenSearch and the legacy Elasticsearch service since both publish to the same AWS/ES namespace.

Features

Fleet overview - domain count in the region, count of yellow domains, count of red domains, average cluster CPU, top 10 domains by storage and search rate
Cluster health - current status (green/yellow/red mapped to colored label), node count, active shards, unassigned shards
Shard states over time - active, unassigned, initializing, relocating, delayed-unassigned
Resources - CPU and JVM memory pressure (avg and max), JVM garbage collection counts and time (young and old gen)
Storage - free storage on the worst node, total cluster used space, storage utilization %, growth trends
Search and indexing - search rate, search latency (avg + p99), indexing rate, indexing latency (avg + p99)
Throttling and errors - thread-pool search/write rejections per second, search/write queue depth, HTTP response codes (2xx/3xx/4xx/5xx), automated snapshot failures

Key CloudWatch Metrics Used

All metrics are from the AWS/ES namespace, dimensioned by ClientId and DomainName (and Node or TargetGroup where applicable).

ClusterStatus.green / yellow / red

Three binary metrics - exactly one is 1, the others are 0. The dashboard maps them to a single colored label panel.

Nodes, Shards.active, Shards.unassigned, Shards.initializing, Shards.relocating

Cluster topology. Sustained Shards.unassigned > 0 usually means a node failed, the disk-low watermark was breached, or replicas have nowhere to land.

CPUUtilization, JVMMemoryPressure

The two resource metrics that actually predict outages. AWS recommends alarming JVM memory pressure at sustained >80%; above 92% means GC thrash is imminent.

JVMGCYoungCollectionCount, JVMGCYoungCollectionTime, JVMGCOldCollectionCount, JVMGCOldCollectionTime

Garbage collection counts and total time. Frequent old-gen GCs with rising time = heap pressure; scale up or add nodes.

FreeStorageSpace, ClusterUsedSpace

Free space on the worst node (Min stat across nodes) vs total cluster used. Rejects writes when nodes hit the disk-low watermark, so alert well before then.

SearchRate, SearchLatency, IndexingRate, IndexingLatency

Throughput and latency for queries and ingestion. CloudWatch supports p99 natively on the latency metrics - the dashboard uses both avg and p99 to expose tail latency.

ThreadpoolSearchRejected, ThreadpoolWriteRejected, ThreadpoolSearchQueue, ThreadpoolWriteQueue

Rejections only emit when non-zero, so any signal here is a real problem. Watch the queue lines lead the rejection lines - when queues fill up, rejections follow.

AutomatedSnapshotFailure

1 when the last automatic snapshot failed. Non-zero needs investigation immediately; backups are your safety net.

2xx / 3xx / 4xx / 5xx

HTTP response code counts from the domain's REST API. Sudden 4xx surges often mean a client started sending bad queries; 5xx surges = the cluster is in trouble.

Prerequisites

Grafana 10.0 or later
AWS CloudWatch datasource plugin configured in Grafana
IAM permissions on the role/user backing the datasource:
- cloudwatch:GetMetricData
- cloudwatch:ListMetrics
- cloudwatch:GetMetricStatistics

No OpenSearch domain access policy changes are required since metrics flow through CloudWatch.

Installation

Download the dashboard JSON.
In Grafana, go to Dashboards → New → Import.
Paste the JSON or upload the file.
When prompted, select your AWS CloudWatch datasource.
Click Import.

Variables

Region - AWS region of your OpenSearch domain
Domain - auto-populated from CloudWatch with every domain that publishes the Nodes metric
Period - CloudWatch aggregation period (60s, 300s, or 3600s)

Troubleshooting

Q: The Domains count says 0 even though I have domains. A: The fleet overview counts only domains currently publishing the Nodes metric. Newly created or paused domains may not appear for a few minutes.

Q: Search/Write Rejected panels are always empty. A: That's healthy - AWS only emits these metrics when non-zero. Empty means no rejections happened in the time range.

Q: Healthy/Unhealthy host panels don't match the console. A: The dashboard sums HealthyHostCount / UnHealthyHostCount across (TargetGroup, AvailabilityZone) pairs because CloudWatch doesn't emit these without both dimensions. The total matches the console.

Q: Will this work with the older Elasticsearch service? A: Yes - both legacy Amazon Elasticsearch Service and OpenSearch Service publish to the AWS/ES namespace with the same metric names.

Revision	Description	Created
			Download

Get this dashboard

Import the dashboard template

Download JSON

Datasource

Dependencies

Resources

Docs: Importing dashboards Webinar: Getting started with Grafana dashboard design Webinar: Building advanced Grafana dashboards

AWS / CloudWatch / OpenSearch

AWS OpenSearch Service Monitoring Dashboard for Grafana - CloudWatch Metrics

Why monitor AWS OpenSearch Service?

Features

Key CloudWatch Metrics Used

ClusterStatus.green / yellow / red

Nodes, Shards.active, Shards.unassigned, Shards.initializing, Shards.relocating

CPUUtilization, JVMMemoryPressure

JVMGCYoungCollectionCount, JVMGCYoungCollectionTime, JVMGCOldCollectionCount, JVMGCOldCollectionTime

FreeStorageSpace, ClusterUsedSpace

SearchRate, SearchLatency, IndexingRate, IndexingLatency

ThreadpoolSearchRejected, ThreadpoolWriteRejected, ThreadpoolSearchQueue, ThreadpoolWriteQueue

AutomatedSnapshotFailure

2xx / 3xx / 4xx / 5xx

Prerequisites

Installation

Variables

Troubleshooting

Tags

Data source config

Collector config:

Get this dashboard

Still have questions?

Get every update

AWS / CloudWatch / OpenSearch

AWS OpenSearch Service Monitoring Dashboard for Grafana - CloudWatch Metrics

Why monitor AWS OpenSearch Service?

Features

Key CloudWatch Metrics Used

ClusterStatus.green / yellow / red

Nodes, Shards.active, Shards.unassigned, Shards.initializing, Shards.relocating

CPUUtilization, JVMMemoryPressure

JVMGCYoungCollectionCount, JVMGCYoungCollectionTime, JVMGCOldCollectionCount, JVMGCOldCollectionTime

FreeStorageSpace, ClusterUsedSpace

SearchRate, SearchLatency, IndexingRate, IndexingLatency

ThreadpoolSearchRejected, ThreadpoolWriteRejected, ThreadpoolSearchQueue, ThreadpoolWriteQueue

AutomatedSnapshotFailure

2xx / 3xx / 4xx / 5xx

Prerequisites

Installation

Variables

Troubleshooting

Related AWS Documentation

Tags

Data source config

Collector config:

Get this dashboard