Menu

Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.

Enterprise

Troubleshooting

You might run into some common issues when running the Graphite Proxy in a production environment. To learn how to react to each issue, keep reading.

Metrics ingestion

Common metrics ingestion issues such as high error rates and high latency.

High error rate on ingest API endpoint

The query that follows shows the error rate of the ingest requests. To run the query replace <job> with the string that corresponds to the Grafana Mimir gateway job or the Graphite write proxy job.

sum by (status_code) (
  rate(
    cortex_request_duration_seconds_bucket{job="<job>", route="graphite_metrics"}[5m]
  )
)
  1. Check if the GEM distributors are returning errors to ingest requests submitted by the Graphite write proxy.

If so, these errors will be forwarded by the Graphite write proxy.

  1. Check if Graphite write proxy instances are running out of memory.

If so, either increase their memory limits or increase the number of instances.

  1. Look at the logs to find more specific errors.

High latency on ingest requests

The query that follows displays the latency of the ingest requests in percentiles. In the query, replace <job> with the string that corresponds to the Grafana Mimir gateway job or the Graphite write proxy job. Also, replace the <percentile> with the percentile that you want to see.

histogram_quantile(
  <percentile>,
  sum by (le) (
    rate(
      cortex_request_duration_seconds_bucket{job="<job>", route="graphite_metrics"}[5m]
    )
  )
)
  1. Check if the increased latency comes from the GEM distributors to which the Graphite write proxy is sending its ingest requests.

If the distributor’s ingest latency is high, then this will also raise the ingest latency of the Graphite write proxy. Therefore, you should try to reduce the distributor latency.

  1. Scale the Graphite write proxy to a higher number of replicas.

Doing so might reduce the latency in some situations, but it will not help if the distributors are slow at handling the ingest requests coming from the Graphite write proxy.

Metrics querying

This section covers issues related to metrics querying.

High error rate on queries

The error rate of queries can be seen with the following query. The <job> will need to be replaced either by the cortex-gw job or by the Graphite querier job.

sum by (status_code) (
  rate(
    cortex_request_duration_seconds_bucket{job="<job>", route="graphite_render"}[5m]
  )
)
  1. Check if the GEM queriers are returning errors to queries submitted by the Graphite querier.

If so these errors will result in errors being returned to the user.

  1. Check if Graphite querier instances are running out of memory. If so you have the following options:
  • Give the Graphite queriers more memory, either by increasing the number of instances or by raising the memory limit of each instance.

It is important to keep in mind that each query gets executed on a single Graphite querier instance. This means that if the rate of queries is low, but each of them consumes really a lot of memory, then increasing the number of Graphite querier instances might not help because each of them might already process only a single query at a time.

In this situation it would be better to give each instance of the Graphite querier more memory instead of increasing their number.

  • Reduce the concurrency at which sub-queries are processed in the Graphite querier. To adjust the concurrency you can update the value of the flag -graphite.querier.query-handling-concurrency.

  • Reduce the number of points to which each result gets aggregated. This might reduce the peak memory usage of the query handling process, but its effectiveness depends on the specific queries. To reduce the number of points which the Graphite querier generates you can use the limit defined via the flag -graphite.querier.max-points-per-req-soft.

  • Reject heavy queries. This will result in an error returned to the user if the submitted query was too heavy, but at least it can prevent out of memory conditions, which means that other queries can still be processed. To reject heavy queries you have the following options:

  • Reduce the number of points each query may produce at max, to do so you can use the flag -graphite.querier.max-points-per-req-hard.

  • Leverage the limits which GEM provides, such as max_fetched_series_per_query.

  1. If a stock Graphite deployment is used for queries that can’t be handled by the Graphite querier, check if Graphite runs out of memory.

If so you can either increase its memory limit or the number of Graphite instances. Just like with the Graphite querier each query gets executed on a single Graphite instance, which means that if you already scaled their number to a level where no two queries get executed concurrently in the same instance then further increasing their number won’t help because the additional instances will be idle.

  1. Look at the logs to find more specific errors.

High latency on queries

The latency on queries in percentiles can be seen with the following query.

histogram_quantile(
  <percentile>,
  sum by (le) (
    rate(
      cortex_request_duration_seconds_bucket{job="<job>", route="graphite_render"}[5m]
    )
  )
)
  1. Check if the increased latency comes from the GEM querier which is used by the Graphite querier.

If the querier is slow then this will also raise the latency of the query handling in the Graphite querier. In this case, try to improve the querier latency.

  1. Check if there is a change in the properties of the queries sent to the Graphite querier, which could lead to the queries being slower to process.

The following query can be used to see the lengths of the time ranges which are being queried, the string <job> must be replaced with the job of the Graphite querier and <percentile> must be replaced with the percentile that you want to see:

histogram_quantile(<percentile>, avg by (le) (rate(graphite_time_range_length_seconds_bucket{job="<job>"}[5m])))

If there is an increase in this metric then users are querying for longer time ranges, then:

a. Check the Memcached eviction rate in the metric name cache and the aggregation cache, to ensure that the cache efficiency is not reduced due to evictions.

If there are lots of evictions then increasing the available memory in Memcached might help.

It might help to increase the Graphite querier concurrency because this will allow it to process the data required to serve queries in more threads concurrently, but keep in mind that this might increase the memory usage as well so watch out for restarts due to out of memory conditions. To do so use the flag -graphite.querier.query-handling-concurrency.

b. The following query can be used to see how many series each Graphite query involves before the function processing, the string <job> must be replaced with the job of the Graphite querier and <percentile> must be replaced with the percentile that you want to see:

histogram_quantile(<percentile>, avg by (le) (rate(graphite_series_per_query_by_phase_bucket{phase="passed_into_graphite_functions", job="<job>"}[5m])))

If there is an increase in this metric then the queries are requiring more series to be processed per query, it may help to adjust the GEM limit max_series_per_query to reject the heavy queries.

  1. Check if more queries are being processed by Graphite instead of the native query engine.

Generally Graphite tends to be much slower than the native query engine, so if queries get processed by Graphite then this can lead to a significant latency increase. To see how many queries get processed by Graphite you can use the following query, the string <job> must be replaced with the Graphite querier’s job:

rate(graphite_proxied_to_graphite{job="<job>"}[1m])

The latency of the Graphite query processing can be seen with the following query, the string <job> must be replaced with the job of the Graphite querier and <percentile> must be replaced with the percentile that you want to see:

histogram_quantile(<percentile>, rate(graphite_proxied_to_graphite_duration_seconds_bucket{job="<job>"}[1m]))

In this situation it might help to scale Graphite to a higher number of processes, there are two things to keep in mind though:

  • Each query is handled by only one instance of the Graphite process, so once you scaled it to a level where no Graphite instance handles more than one query concurrently then further increasing the number of instances will not help because any additional Graphite instances would just be idle.

  • Increasing the number of available cores per Graphite process tends to not benefit the latency much because of how Graphite schedules its worker threads. Instead of increasing the cores it is better to increase the number of Graphite instances.

Query results look different from what you would expect

A user reports that they submit a query and they get a result back, but the result looks different from what they’d expect. This could happen due to a wide variety of issues, but one common issue is that the storage-schema or storage-aggregation configurations are incorrect.

  1. A wrong storage-aggregation configuration can lead to results which look incorrect, a very common mistake is that the used aggregation method is not appropriate for the type of data which gets horizontally aggregated in the query handling.

For example a metric of type gauge should never be summed when horizontally aggregated because the resulting values will not be useful.

To read more about the storage-aggregation configuration refer to storage aggregations.

  1. If the target interval which gets determined based on the storage-schema configuration is lower than the interval of the raw data stored in Grafana Mimir, then this can also lead to results which may be wrong in various different ways depending on the Graphite functions used in the query.

For more information about the storage-schema configuration refer to storage schemas.