Monitoring compactor health
Grafana Enterprise Metrics emits several metrics related to compactor health.
The following queries are useful to get a high-level view of compactor
activity. For users with self-monitoring enabled, please see the
GEM system monitoring / compactor dashboard, which includes panels built from these queries.
Successful compactor jobs run per hour
This value should be relatively stable when viewed over a long enough period of time, for example hours or days.
Failed compactor jobs run per hour
Note: Restarting the compactor process will interrupt in process compaction jobs. This will increase the value of
cortex_compactor_runs_failed_total, but it is not cause for concern as long as these restarts are expected. In the event of a compactor crash, this metric will not be incremented. Compactor process crash events should be monitored separately.
Number of blocks per tenant
sum by (user) (cortex_bucket_blocks_count - cortex_bucket_blocks_marked_for_deletion_count)
This value should be relatively stable over a long enough period of time, for example several days. If the compactor is lagging behind, it will increase over time.
Monitoring bucket index health
Before enabling the bucket index, the index health can
be verified by monitoring the
cortex_bucket_index_last_successful_update_timestamp_seconds metric. This
metric tracks the last successful bucket index update per tenant. The following
query can be used to determine the index age for each tenant:
time() - cortex_bucket_index_last_successful_update_timestamp_seconds
The maximum index age should generally line up with the value of the
Note: Some jitter is added to the cleanup interval to prevent all compactor replicas from running at the same moment every time the interval elapses. Additionally, the cleanup takes some time to perform. Because of this, you may see the index age slightly older than the cleanup interval. This is not cause for concern. We recommend configuring an alerting threshold when the index age exceeds (2 *
compactor.cleanup-interval) + 5 minutes.