Enterprise

Troubleshoot GEM

You might encounter these issues while operating a GEM cluster. Follow these steps to troubleshoot.

Write path

Use the Writes and Writes Resources GEM system monitoring dashboards for insight into the performance of the write path.

Read the Writes dashboard from top to bottom. Each row represents a step in write path processing. You can isolate high write-latency to a specific component by finding the dashboard row with increased latency. After isolating the component, use the per-instance graph panel to narrow down the issue to specific instances.

Typically, 99 percentile latency (P99 latency) for distributors ranges 50-100ms. If this value is higher, you might need to scale up distributors.

For ingesters, P99 latency typically ranges 5-50ms. If this value is higher, investigate the root cause before scaling up ingesters.

Increased latency can have a number of causes, including compute or disk resource starvation. Use the Writes Resources dashboard to investigate the compute and disk resources in-use by each component of the cluster involved in the write path.

Out-of-order sample errors

Unless you’ve configured experimental out-of-order sample ingestion, GEM must ingest samples of each series in order. If this requirement isn’t met, GEM returns the out-of-order sample error. To learn more about how to configure experimental out-of-order sample ingestion, refer to Configure out-of-order samples ingestion in the Mimir documentation.

You can query the rate of out-of-order sample errors, as shown:

promql
sum by (reason) (rate(cortex_discarded_samples_total{reason="sample-out-of-order"}[$__rate_interval]))

Common reasons for samples sent out-of-order include:

  • Multiple Prometheus servers or Grafana agents sending the same data.
  • Non-configured, or misconfigured, high-availability (HA) tracking. HA tracking is the specific configuration used to deduplicate writes from Prometheus server HA pairs.

When multiple clients send the same data, you might see logs with the sample with repeated timestamp but different value message. To enable source IP logging, set -server.log-source-ips-enabled = true in the configuration YAML file. For details about configuring source IP logging, refer to the server_config section of the reference configuration page.

HA tracking uses labels to deduplicate writes from HA Prometheus servers or Grafana agents scraping the same targets. Ensure that all samples are sent with a specific cluster and replica label. By default, these use the label names cluster and __replica__, but you can set these values individually for each tenant. During deduplication with HA tracking enabled, the replica label is removed from the samples. However, with misconfigured clients, these labels may not be present on all samples, which may be ingested with these labels intact.

You can identify series without the replica label, for example __replica__, with the following query:

promql
count({__name__=~".+", __replica__=""})

Note

Only the first sample in a remote-write batch is checked for deduplication. It’s important to configure the correct external labels for all samples.

Timestamp-too-old errors

When a sample is older than what the GEM time-series database (TSDB) accepts, GEM returns a timestamp-too-old error. This limit is generally 1-2 hours back, depending on when the last block was cut. These bounds are relative to the timestamps sent and stored in the TSDB, rather than to the GEM server’s wall-clock. TSDBs are separate for each GEM tenant, and samples sent to one tenant don’t affect another.

You can query the rate of timestamp-too-old errors, as shown:

promql
sum by (reason) (rate(cortex_discarded_samples_total{reason="sample-timestamp-too-old"}[$__rate_interval]))

One possible causes for the error is a client with a skewed wall-clock that is sending samples with a timestamp ahead of all other clients sending to the same tenant. You can check the wall-clock of clients against GEM’s wall-clock with the following query:

promql
abs(node_time_seconds - timestamp(node_time_seconds))

Verify ingester ring status

Note

In GEM versions 1.6 and earlier, don’t expose the ring page on the ingester microservice. Make sure to connect to a distributor, querier, or ruler instead.

The following examples assume that you’re forwarding a GEM component with the ingester ring page using kubectl port-forward and listening to the localhost port 8080.

For a list of ring members, run this command:

$ curl -s -H "Accept: application/json" http://localhost:8080/ingester/ring | jq '.shards[] | del(.tokens)'
{
  "id": "ingester-0",
  "state": "ACTIVE",
  "address": "127.0.0.1:9095",
  "timestamp": "2021-12-22 09:43:06 +0000 GMT",
  "registered_timestamp": "2021-12-22 09:39:36 +0000 GMT",
  "zone": ""
}
{
  "id": "ingester-1",
  "state": "Unhealthy",
  "address": "127.0.0.1:9095",
  "timestamp": "2021-12-22 09:39:20 +0000 GMT",
  "registered_timestamp": "2021-12-22 09:33:44 +0000 GMT",
  "zone": ""
}

To remove an unhealthy instance from the ring, run this command:

Note

The response code of the endpoint is 302, regardless of whether the request succeeds or fails. Monitor the logs for potential errors.

$ curl -v -d forget=ingester-1 http://localhost:8080/ingester/ring
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> POST /ingester/ring HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.79.1
> Accept: */*
> Content-Length: 19
> Content-Type: application/x-www-form-urlencoded
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found
< Location: #
< Date: Wed, 22 Dec 2021 09:47:30 GMT
< Content-Length: 0
<

Too-many-inflight-push-requests errors

A too-many-inflight-push-requests error can occur in both the distributor and the ingester.

Run these queries to determine if there’s a single ingester or distributor causing the issue:

promql
sum(cortex_distributor_inflight_push_requests) by (namespace, cluster, pod)

sum(cortex_ingester_inflight_push_requests) by (namespace, cluster, pod)

One possible cause of too many inflight push request on ingesters is throttled or underperforming disk I/O. This also manifests in increased TSDB commit durations, which you can check with this query:

promql
histogram_quantile(0.9, (sum(rate(cortex_ingester_tsdb_appender_commit_duration_seconds_bucket[5m])) by (namespace, cluster, le)))

Read path

Use the Reads and Reads Resources GEM system monitoring dashboards for insight into the performance of the read path.

Read the Reads dashboard from top to bottom. Each row represents a step in processing the read path.

Read path latency is more variable than write path latency, as it depends on the kinds of queries you run. Increased latency can have a number of causes, including compute or disk resource starvation. Use the Reads Resources dashboard to investigate the compute and disk resources in-use by each component of the cluster involved in the read path.

Compactor

Use the Compactor and Compactor Resources GEM system monitoring dashboards for insight into the performance of the compactor.

Compactions failing

Use the following PromQL expression to determine which instances of the compactor haven’t completed a successful compaction in the last 24 hours:

promql
(time() - cortex_compactor_last_successful_run_timestamp_seconds > 60 * 60 * 24)
and
(cortex_compactor_last_successful_run_timestamp_seconds > 0)

To investigate the cause of compaction failures, view the logs of the affected compactor instance.

Block corruption

Corrupted blocks can cause failed compactions. Use the following LogQL expression to identify logs that point to corrupted blocks:

<compactor label matchers> |= "level=error" |= "not healthy index found"

A full log line has an err key that contains more information for resolving this error. You can use the compaction level to understand whether it’s safe to move the block away and allow the compactor to proceed. In a cluster with an ingester replication factor of three and a single not healthy index error, it’s safe to move a block of Compaction level 1 out of the tenant’s bucket directory. No data loss occurs, as the replicated blocks still exist in object storage, as they haven’t yet been vertically compacted. For example:

level=error ts=2020-07-12T17:35:05.516823471Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=REDACTED-TENANT err="compaction: group 0@6672437747845546250: block with not healthy index found /data/compact/0@6672437747845546250/REDACTED-BLOCK; Compaction level 1; Labels: map[__org_id__:REDACTED]: 1/1183085 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
Move a corrupt block using MinIO Client

To allow the compactor to resume the operation, move the corrupted block into the bucket using MinIO Client or an alternative tool:

shell
# setup the object store
mc alias set my-object-store https://<ENDPOINT> <ACCESS-KEY> <SECRET-ACCESS-KEY> [--insecure]

# move blocks so its ignored during compaction
mc mv --recursive my-object-store/<BUCKET>/<TENANT>/<BLOCK> my-object-store/<BUCKET>/<TENANT>/corrupted-<BLOCK>

Where:

  • BUCKET is the bucket name the compactor is using.
  • TENANT is the tenant ID reported in the example error message as REDACTED-TENANT.
  • BLOCK is the last part of the path reported as REDACTED-BLOCK in the example error message.

Ingester

Restarts

Review the logs of the affected ingester to understand the reason for the restart. An ingester might restart due to being Out of Memory Killed (OOMkilled).

In Kubernetes, you can confirm this reason with kubectl:

console
$ kubectl get pod ${POD} -o json | jq -r '.status.containerStatuses[] | { name, .lastState.terminated.reason }'
{
  "name": "ingester",
  "reason": "OOMKilled",
}

On a Linux server, run:

console
$ grep oom /var/log/*

If the ingester is OOMKilled, check for increased load. If there has been an increase in the number of active series, there might not be enough memory provisioned for each ingester. After an outage, lagging clients could send samples at a higher rate, which can temporarily increase the load on the system, including the ingesters.

Out-of-memory errors on start up

When a GEM ingester crashes, it must process its Write Ahead Log (WAL) on start up to recover any data that was in-memory and not yet written out to disk as TSDB blocks (to be later uploaded to object storage). Depending on the reason the ingester crashed, it could crash again due to running out of memory while trying to process the WAL. If, after several attempts, an ingester isn’t able to finish processing its WAL, you can move or remove the data in the WAL. This requires access to the directory used by the ingester for the WAL.

This example shows removing the WAL for the default tenant, fake.

console
rm -r /data/tsdb/fake/wal

If you’re using multiple tenants, attempt to remove the WAL for only the largest tenant first. For example:

console
du -ms /data/tsdb/* | sort -n
840     /data/tsdb/__system__
10637   /data/tsdb/team-a
28425   /data/tsdb/team-b

rm -r /data/tsdb/team-b/wal

Caution

This could cause data loss for the tenant in question. We recommend running GEM with a replication factor of three, meaning that data on any one ingester exists on two other ingesters. However, if you need to remove the WAL of more than a single ingester, you could lose data. Even removing the WAL on a single ingester increases the chance that a hardware failure affecting other ingesters causes data loss. Only do this as a last resort to make a GEM cluster stable.

Blocks

Map a block identifier to a date and time range

Each block has a Universally Unique Lexicographically Sortable Identifier (ULID). For additional metadata information about the block, refer to the meta.json file inside the block directory.

To determine the start and end timestamp from a block’s meta.json file, run the following:

console
$ jq '{ "start": (.minTime / 1000 | todate), "end": (.maxTime / 1000 | todate) }'  01FBBE5RQV8WT7D81NYYSPYHTH.json

Caching

“memcache: connect timeout” errors

Latency when establishing a connection to a Memcached server, including any required DNS lookups, can result in a timeout error. GEM maintains a pool of connections to Memcached servers and reuses connections from that pool. While running, the expected rate of newly created connections from GEM is near zero.

If you’re running the memcached_exporter, you can query the rate of new connections, as shown:

promql
rate(memcached_connections_total[$__rate_interval])

This error is logged when a single GEM server attempts a large number of parallel connections. By default, the maximum number of concurrent connections is 100 and is governed by the -*.memcached.max-get-multi-concurrency flags. The default connection pool is sixteen. If the maximum number of concurrent connections is established, eighty-four (100 - 16) connections are opened in parallel, which may exceed the connection timeout.

To mitigate this error, tune the connection pool using the -*.memcached.max-idle-connections flags.