Enterprise

Monitor compactor health

Grafana Enterprise Metrics emits several metrics related to compactor health. The following queries are useful to get a high-level view of compactor activity. For users with self-monitoring enabled, please see the GEM system monitoring / compactor dashboard, which includes panels built from these queries.

Successful compactor jobs run per hour

promql
sum(increase(cortex_compactor_runs_completed_total[1h]))

This value should be relatively stable when viewed over a long enough period of time, for example hours or days.

Failed compactor jobs run per hour

promql
sum(increase(cortex_compactor_runs_failed_total[1h]))

Note: Restarting the compactor process will interrupt in process compaction jobs. This will increase the value of cortex_compactor_runs_failed_total, but it is not cause for concern as long as these restarts are expected. In the event of a compactor crash, this metric will not be incremented. Compactor process crash events should be monitored separately.

Number of blocks per tenant

promql
sum by (user) (cortex_bucket_blocks_count - cortex_bucket_blocks_marked_for_deletion_count)

This value should be relatively stable over a long enough period of time, for example several days. If the compactor is lagging behind, it will increase over time.

Monitoring bucket index health

Before enabling the bucket index, the index health can be verified by monitoring the cortex_bucket_index_last_successful_update_timestamp_seconds metric. This metric tracks the last successful bucket index update per tenant. The following query can be used to determine the index age for each tenant:

promql
time() - cortex_bucket_index_last_successful_update_timestamp_seconds

The maximum index age should generally line up with the value of the -compactor.cleanup-interval flag.

Note: Some jitter is added to the cleanup interval to prevent all compactor replicas from running at the same moment every time the interval elapses. Additionally, the cleanup takes some time to perform. Because of this, you may see the index age slightly older than the cleanup interval. This is not cause for concern. We recommend configuring an alerting threshold when the index age exceeds (2 * compactor.cleanup-interval) + 5 minutes.