This is documentation for the next version of Grafana Loki documentation. For the latest stable release, go to the latest version.
Key metrics for monitoring Loki
Loki exposes many metrics, and each component behaves differently under load. This page focuses on the highest-signal metrics for detecting negative trends early.
Note
The example queries on this page are PromQL. Run them against the Prometheus-compatible data source where your Loki metrics are stored (for example, Prometheus, Mimir, or Grafana Cloud Metrics).
For setup and prebuilt dashboards and alerts, refer to:
Request error rate
Watch request failures first. A sustained increase in 5xx responses is usually the earliest sign of user-visible impact.
Key metric:
loki_request_duration_seconds_count(counter with labels includingstatus_code,job, androute)
Example query:
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster, namespace, job, route)
/
sum(rate(loki_request_duration_seconds_count[2m])) by (cluster, namespace, job, route)Abnormal behavior:
- Any sustained increase in 5xx ratio.
- The Loki mixin alert
LokiRequestErrorsfires when this ratio is greater than 10% for 15 minutes.
Request latency (p99)
Latency degradation can appear before hard failures. Track p99 for read and write routes.
Key metric:
loki_request_duration_seconds_bucket(histogram buckets)
Example query:
histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[1m])) by (le, cluster, namespace, job, route))Abnormal behavior:
- Rising p99 over time, especially in query-frontend and distributor paths.
- The Loki mixin alert
LokiRequestLatencyfires when p99 exceeds 1 second for 15 minutes.
Panics
Panics are high-severity faults and should stay at zero.
Key metric:
loki_panic_total
Example query:
sum(increase(loki_panic_total[10m])) by (cluster, namespace, job)Abnormal behavior:
- Any value above zero. The Loki mixin alert
LokiRequestPanicstreats this as critical.
Discarded samples
Discarded samples indicate data that Loki rejected or dropped. This is one of the most important ingestion-quality signals.
Key metric:
loki_discarded_samples_total
Example query:
topk(10, sum by (tenant, reason) (rate(loki_discarded_samples_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval])))Abnormal behavior:
- Increasing discard rate.
- New or growing
reasonvalues (for example, tenant limits or stream limits).
Compaction health
Compaction issues can silently degrade read performance and retention behavior over time.
Note
The compaction and retention metrics use a
loki_boltdb_shipper_prefix for historical reasons. The compactor emits these metrics regardless of which index type you use, including TSDB.
Key metrics:
loki_boltdb_shipper_compactor_runningloki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_secondsloki_boltdb_shipper_compact_tables_operation_totalloki_boltdb_shipper_compact_tables_operation_duration_seconds
Example queries:
sum(loki_boltdb_shipper_compactor_running) by (cluster, namespace)time() - (loki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_seconds > 0)Abnormal behavior:
- More than one compactor running.
- No successful compaction for multiple hours (the mixin alert threshold is 3 hours).
Ingester health and flush behavior
Ingester pressure often appears as memory growth, poor chunk utilization, or flush backlog.
Key metrics:
loki_ingester_memory_streamsloki_ingester_memory_chunksloki_ingester_flush_queue_lengthloki_ingester_chunk_utilizationloki_ingester_chunks_flushed_total
Example queries:
sum(loki_ingester_memory_streams{cluster="$cluster", namespace="$namespace"})sum(loki_ingester_flush_queue_length{cluster="$cluster", namespace="$namespace"})Abnormal behavior:
- Persistent growth in in-memory streams or chunks.
- Increasing flush queue length.
- Low chunk utilization for long periods.
Distributor throughput
Throughput changes help identify upstream sender issues, sudden traffic shifts, or ingestion bottlenecks.
Key metrics:
loki_distributor_bytes_received_totalloki_distributor_lines_received_total
Example queries:
sum(rate(loki_distributor_bytes_received_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))sum(rate(loki_distributor_lines_received_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))Abnormal behavior:
- Sharp drops (possible data path interruption).
- Unexpected spikes (possible overload or noisy tenants).
Object store operations
Object store latency and failures directly impact query and retention workflows.
Key metrics:
loki_objstore_bucket_operations_totalloki_objstore_bucket_operation_failures_totalloki_objstore_bucket_operation_duration_seconds
Example queries:
sum by (operation) (rate(loki_objstore_bucket_operation_failures_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))histogram_quantile(0.99, sum(rate(loki_objstore_bucket_operation_duration_seconds_bucket{cluster="$cluster", namespace="$namespace"}[$__rate_interval])) by (le, operation))Abnormal behavior:
- Growing failure rates by operation.
- Increasing p99 latency for
get,get_range, orupload.
Resource and runtime health
Resource pressure can explain or predict service degradation before alert thresholds are crossed.
Common signals to track:
- Container CPU usage
- Container memory working set
- Go heap in use
- Disk read and write rates
- Container restarts
Abnormal behavior:
- Repeated restart spikes.
- Sustained CPU saturation.
- Memory growth without recovery.
Loki Canary (end-to-end data verification)
If you run Loki Canary, use it as an end-to-end correctness signal, not only a performance signal.
Key metrics:
loki_canary_missing_entries_totalloki_canary_spot_check_missing_entries_totalloki_canary_response_latency_seconds_bucket
Example query:
sum(increase(loki_canary_missing_entries_total{cluster=~"$cluster", namespace=~"$namespace"}[$__range]))
/
sum(increase(loki_canary_entries_total{cluster=~"$cluster", namespace=~"$namespace"}[$__range]))
* 100Abnormal behavior:
- Any non-zero missing rate sustained over time.
Internal error log rate
Internal logs provide fast context when metrics indicate degradation.
Key metric:
loki_internal_log_messages_total
Use this metric with component logs to correlate where failures begin.
Retention and sweeper progress
Retention and sweeper lag can cause storage growth and delayed data lifecycle actions.
Key metrics:
loki_compactor_apply_retention_last_successful_run_timestamp_secondsloki_boltdb_shipper_retention_sweeper_marker_file_processing_current_timeloki_boltdb_shipper_retention_sweeper_chunk_deleted_duration_seconds_count
Example query:
time() - (loki_boltdb_shipper_retention_sweeper_marker_file_processing_current_time{cluster="$cluster", namespace="$namespace"} > 0)Abnormal behavior:
- Increasing sweeper lag.
- Falling delete throughput or sustained delete failures.
Next steps
- Install and keep the latest Loki mixin dashboards and alerts.
- Build component-specific alerts on top of these baseline signals.
- Pair metrics with Loki component logs, including
metrics.go, during incident response.


