Documentation

Grafana Loki

Manage

Monitor Loki

Key Metrics

Open source

Key metrics for monitoring Loki

Loki exposes many metrics, and each component behaves differently under load. This page focuses on the highest-signal metrics for detecting negative trends early.

Note
The example queries on this page are PromQL. Run them against the Prometheus-compatible data source where your Loki metrics are stored (for example, Prometheus, Mimir, or Grafana Cloud Metrics).

For setup and prebuilt dashboards and alerts, refer to:

Request error rate

Watch request failures first. A sustained increase in 5xx responses is usually the earliest sign of user-visible impact.

Key metric:

loki_request_duration_seconds_count (counter with labels including status_code, job, and route)

Example query:

100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster, namespace, job, route)
/
sum(rate(loki_request_duration_seconds_count[2m])) by (cluster, namespace, job, route)

Abnormal behavior:

Any sustained increase in 5xx ratio.
The Loki mixin alert LokiRequestErrors fires when this ratio is greater than 10% for 15 minutes.

Request latency (p99)

Latency degradation can appear before hard failures. Track p99 for read and write routes.

Key metric:

loki_request_duration_seconds_bucket (histogram buckets)

Example query:

histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[1m])) by (le, cluster, namespace, job, route))

Abnormal behavior:

Rising p99 over time, especially in query-frontend and distributor paths.
The Loki mixin alert LokiRequestLatency fires when p99 exceeds 1 second for 15 minutes.

Panics

Panics are high-severity faults and should stay at zero.

Key metric:

loki_panic_total

Example query:

sum(increase(loki_panic_total[10m])) by (cluster, namespace, job)

Abnormal behavior:

Any value above zero. The Loki mixin alert LokiRequestPanics treats this as critical.

Discarded samples

Discarded samples indicate data that Loki rejected or dropped. This is one of the most important ingestion-quality signals.

Key metric:

loki_discarded_samples_total

Example query:

topk(10, sum by (tenant, reason) (rate(loki_discarded_samples_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval])))

Abnormal behavior:

Increasing discard rate.
New or growing reason values (for example, tenant limits or stream limits).

Compaction health

Compaction issues can silently degrade read performance and retention behavior over time.

Note
The compaction and retention metrics use a loki_boltdb_shipper_ prefix for historical reasons. The compactor emits these metrics regardless of which index type you use, including TSDB.

Key metrics:

loki_boltdb_shipper_compactor_running
loki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_seconds
loki_boltdb_shipper_compact_tables_operation_total
loki_boltdb_shipper_compact_tables_operation_duration_seconds

Example queries:

sum(loki_boltdb_shipper_compactor_running) by (cluster, namespace)

time() - (loki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_seconds > 0)

Abnormal behavior:

More than one compactor running.
No successful compaction for multiple hours (the mixin alert threshold is 3 hours).

Ingester health and flush behavior

Ingester pressure often appears as memory growth, poor chunk utilization, or flush backlog.

Key metrics:

loki_ingester_memory_streams
loki_ingester_memory_chunks
loki_ingester_flush_queue_length
loki_ingester_chunk_utilization
loki_ingester_chunks_flushed_total

Example queries:

sum(loki_ingester_memory_streams{cluster="$cluster", namespace="$namespace"})

sum(loki_ingester_flush_queue_length{cluster="$cluster", namespace="$namespace"})

Abnormal behavior:

Persistent growth in in-memory streams or chunks.
Increasing flush queue length.
Low chunk utilization for long periods.

Distributor throughput

Throughput changes help identify upstream sender issues, sudden traffic shifts, or ingestion bottlenecks.

Key metrics:

loki_distributor_bytes_received_total
loki_distributor_lines_received_total

Example queries:

sum(rate(loki_distributor_bytes_received_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))

sum(rate(loki_distributor_lines_received_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))

Abnormal behavior:

Sharp drops (possible data path interruption).
Unexpected spikes (possible overload or noisy tenants).

Object store operations

Object store latency and failures directly impact query and retention workflows.

Key metrics:

loki_objstore_bucket_operations_total
loki_objstore_bucket_operation_failures_total
loki_objstore_bucket_operation_duration_seconds

Example queries:

sum by (operation) (rate(loki_objstore_bucket_operation_failures_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))

histogram_quantile(0.99, sum(rate(loki_objstore_bucket_operation_duration_seconds_bucket{cluster="$cluster", namespace="$namespace"}[$__rate_interval])) by (le, operation))

Abnormal behavior:

Growing failure rates by operation.
Increasing p99 latency for get, get_range, or upload.

Resource and runtime health

Resource pressure can explain or predict service degradation before alert thresholds are crossed.

Common signals to track:

Container CPU usage
Container memory working set
Go heap in use
Disk read and write rates
Container restarts

Abnormal behavior:

Repeated restart spikes.
Sustained CPU saturation.
Memory growth without recovery.

Loki Canary (end-to-end data verification)

If you run Loki Canary, use it as an end-to-end correctness signal, not only a performance signal.

Key metrics:

loki_canary_missing_entries_total
loki_canary_spot_check_missing_entries_total
loki_canary_response_latency_seconds_bucket

Example query:

sum(increase(loki_canary_missing_entries_total{cluster=~"$cluster", namespace=~"$namespace"}[$__range]))
/
sum(increase(loki_canary_entries_total{cluster=~"$cluster", namespace=~"$namespace"}[$__range]))
* 100

Abnormal behavior:

Any non-zero missing rate sustained over time.

Internal error log rate

Internal logs provide fast context when metrics indicate degradation.

Key metric:

loki_internal_log_messages_total

Use this metric with component logs to correlate where failures begin.

Retention and sweeper progress

Retention and sweeper lag can cause storage growth and delayed data lifecycle actions.

Key metrics:

loki_compactor_apply_retention_last_successful_run_timestamp_seconds
loki_boltdb_shipper_retention_sweeper_marker_file_processing_current_time
loki_boltdb_shipper_retention_sweeper_chunk_deleted_duration_seconds_count

Example query:

time() - (loki_boltdb_shipper_retention_sweeper_marker_file_processing_current_time{cluster="$cluster", namespace="$namespace"} > 0)

Abnormal behavior:

Increasing sweeper lag.
Falling delete throughput or sustained delete failures.

Next steps

Install and keep the latest Loki mixin dashboards and alerts.
Build component-specific alerts on top of these baseline signals.
Pair metrics with Loki component logs, including metrics.go, during incident response.

Was this page helpful?

Suggest an edit in GitHub

Create a GitHub issue

Email docs@grafana.com

Help and support

Community

Key metrics for monitoring Loki

Request error rate

Request latency (p99)

Panics

Discarded samples

Compaction health

Ingester health and flush behavior

Distributor throughput

Object store operations

Resource and runtime health

Loki Canary (end-to-end data verification)

Internal error log rate

Retention and sweeper progress

Next steps

Was this page helpful?

Still have questions?

Get every update

Key metrics for monitoring Loki

Request error rate

Request latency (p99)

Panics

Discarded samples

Compaction health

Ingester health and flush behavior

Distributor throughput

Object store operations

Resource and runtime health

Loki Canary (end-to-end data verification)

Internal error log rate

Retention and sweeper progress

Next steps

Was this page helpful?

Related resources from Grafana Labs