How to keep Ingress NGINX Controller metric volumes manageable and still meaningful
The Ingress NGINX Controller is a widely used Kubernetes component for managing HTTP and HTTPS traffic routing. While it provides powerful observability through Prometheus metrics, it’s also notorious for generating an excessively high number of time series.
The root cause lies in how the controller labels its metrics—tracking requests across multiple dimensions such as ingress name, host, path, status code, and upstream response times. When combined with high-traffic workloads, multi-tenant environments, or large-scale Kubernetes clusters, this granularity leads to an explosion of unique metric series.
In this blog post, we’ll explore the key factors contributing to this metric growth and practical strategies to mitigate the issue while retaining essential observability signals. We’ll also look at the tools available in Grafana Cloud that can help reduce metrics clutter and costs.
Factors contributing to a high volume of metrics
Let’s start by reviewing the aspects of the Ingress-NGINX metrics design that potentially lead to high cardinality.
Multiple labels per request
Request metrics include labels for attributes like HTTP path, status code, method, host, and ingress name. Each label that has many possible values multiplies the total number of metric series. For example, the controller labels HTTP request counters and histograms by path, status, and method. This is done by default, which significantly affects the total number of time series.Â
In reality, the environment always has numerous distinct URLs, status codes, and methods. This can cause histogram metrics like nginx_ingress_controller_request_duration_seconds
to explode into thousands of series because each bucket for each combination of path+status+method becomes a separate metric.
Per-host metrics
By default, the controller labels request metrics with the hostname of the request. In multi-tenant clusters or those serving many domains, each host adds a new label value.Â
For wildcard host ingresses (when ingress is configured to catch all subdomains), there is an intentional safeguard to prevent the metrics from exploding in cardinality: the controller avoids emitting per-host metrics since the potential host label values are unbounded. However, if per-host metrics are enabled, a wildcard ingress could generate metrics for every unique host seen in traffic. This could potentially lead to tens of thousands of series if clients use varying subdomains.
Path-label granularity
Historically, the controller included the exact request path as a metric label, but this was changed due to cardinality concerns. Today, Ingress-NGINX only uses the configured ingress path rules (not the full URL) for metrics, and by default it disables per-path metrics beyond the root path. As a result, unless you explicitly configure multiple path rules in an Ingress (which in reality isn’t very rare), all requests might be reported under a single generic path label (e.g., path="/"). This is why many users only see “/” as the path in metrics, regardless of the actual request URL. It’s a safety trade-off: limiting path labels prevents a metric explosion at the cost of losing visibility into individual sub-paths.
Duplicate metrics from replicas
The Ingress controller is almost always run with multiple replicas for high availability. Each controller pod produces its own set of metrics, typically distinguished by a label like controller_pod
. This means if you scale the deployment to five pods, you will see roughly 5Ă— the number of metrics—one set per pod.Â
All controllers report on all ingresses (in the default deployment model), so the same counters are repeated for each instance. While the controller_pod
label lets you break down metrics by instance, it also multiplies the total time series stored in Prometheus. Until the architecture is changed to centralize metric collection, horizontal scaling inherently increases metric count linearly with the number of controller pods.
In summary, the metric count grows multiplicatively with the number of unique label values. As your cluster and ingress rules scale up, this quickly balloons.
Strategies to mitigate metrics explosion (while retaining observability)
Thankfully, there are several approaches to reduce the number of metrics produced by Ingress-NGINX without losing all useful visibility. Here are some recommended strategies:
Utilize command line arguments
Ingress-NGINX offers flags to turn off certain labels:
- Use
--metrics-per-host=false
to disable per-host metrics. This will stop labeling metrics by the request’s hostname. You’ll still get metrics per ingress object (and overall totals), but multiple hosts on the same ingress are aggregated. This drastically cuts down series count in multi-host environments (at the cost of not distinguishing traffic by host). - Keep the path label disabled. This is the default behavior, and unless you have a strong need to see metrics split by different URL paths, avoid enabling any deprecated option for per-path metrics. The controller deliberately reports all traffic on an ingress rule as one path to prevent Prometheus overload. If you truly need path-level insight, consider defining only a limited set of known paths as separate ingress rules, or use logging or tracing for more granular URL data instead.
- Aggregate status codes by class. In newer versions, you can run the controller with
--report-status-classes=true
so that it reports HTTP responses as 2xx, 3xx, 4xx, 5xx groups instead of every code. This means you get one time series for “success” vs. “redirection” vs. “client error” vs. “server error” per ingress (four categories), rather than potentially dozens of series (200, 201, 204, 301, 302, 404, 500, 502, etc.). This dramatically reduces cardinality on metrics likenginx_ingress_controller_requests
without losing high-level error rate visibility. - Control buckets configuration for histograms. Multiple built-in options provide a user with a way to control the behavior of histogram buckets. For instance
--bucket-factor
or--max-buckets
for native histograms or--time-buckets
,--length-buckets
and--size-buckets
for the “classical” ones.
Aggregate metrics to reduce cardinality
A powerful approach to keep metric growth in check is to use Prometheus recording rules or external tools to pre-aggregate high-cardinality data. For example, you might record per-ingress totals by summing over hosts and paths, and use those in dashboards instead of raw metrics. The idea is to decompose metrics by fewer dimensions. Grafana provides multiple tools to work with such aggregations, which we’ll cover in the next section.
Focus on key metrics
Lastly, it may sound very obvious, but it’s worth reviewing which metrics are truly essential for your observability goals. You might not need every single metric the controller provides. Community best practices suggest focusing on a core set (e.g., request rate, success/error rate, latency percentiles, connection count) and tolerating less visibility on rarely-used labels.
If certain labels like method or detailed status are not useful for your monitoring or alerting, drop them. Similarly, you might disable a subset of metrics entirely like response size histograms if you never chart those. By pruning the metric set to the vital signals, you retain observability where it counts and avoid overloading your systems with low-value data.
By applying a combination of the above strategies, you can drastically reduce the number of time series generated by the controller while still capturing the essential performance and traffic data. For instance, you might run the controller with host metrics off and status codes aggregated, and set up a Prometheus recording rule to sum per-ingress traffic.Â
The outcome is a slimmed-down metrics footprint that covers overall throughput, error rates, and latencies for each ingress—sufficient for alerting and dashboards—without the combinatorial explosion of every host/path/status.
How to use Grafana Cloud to control metric growth
While all the strategies we just discussed could be implemented on the controller or Prometheus level, Grafana Cloud also offers purpose-built tools that are highly effective in mitigating cardinality issues from the NGINX ingress controller and keeping ingestion costs under control.
Cardinality management dashboards
If you want to optimize metric cardinality, you first need to understand which metrics are generating most of the volume. Grafana Cloud provides a set of cardinality management dashboards that help you to analyze how metrics and labels are distributed across the time series data you send to Grafana Cloud Metrics. The dashboards also provide usage information about your metrics to help you understand which of the metrics you’re storing in Grafana Cloud are being utilized.
Adaptive Metrics
Adaptive Metrics is a cardinality optimization feature that allows you to identify and eliminate unused time series metrics data by means of aggregation. Recommended rules identify what metrics to aggregate based on usage within your cloud environment. Using information discovered via the cardinality management dashboards, users can also add rules manually, even if particular active series are used in your Grafana stack.
Kubernetes Monitoring Helm chart
Kubernetes Monitoring in Grafana Cloud utilizes a plug-and-play Helm chart that provides you with multiple features to optimize the data scraping process. For instance, with the autoDiscovery feature, you can control which Kubernetes objects to scrape. Or you can also increase the scrape interval for particular targets to reduce the amount of data points.
Find the right balance for managing metric cardinalityÂ
High metric cardinality is a known challenge when monitoring Kubernetes ingress controllers. Ingress-NGINX provides rich metrics, but if left unchecked, they can overwhelm Prometheus and impact performance. Understanding how these metrics are generated (and which labels drive cardinality) is the first step. From there, you can make informed decisions to disable unnecessary labels, enable built-in optimizations (like status code classes), and implement aggregation to control the data volume.
The key is to strike a balance between observability and cost/complexity. You need to retain the metrics that give you insight into your system’s health and SLIs, while trimming away or consolidating those that are too granular. By following recommendations and best practices like the ones outlined in this blog post, you can keep your metrics manageable and meaningful. This will ensure your monitoring remains effective and your Ingress-NGINX controller runs efficiently, even as your cluster grows.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!