When there’s a cardinality explosion, it can cause problems: It’s a surprise, it’s noise, and it can increase your costs or cause performance degradation of your systems.
Over the past year, we’ve improved our time series storage systems so that under normal use, high cardinality is no longer an issue. But as the operator of an observability platform, you should have tools you need to help protect that infrastructure.
That’s why Grafana Labs has created our cardinality management dashboards: a set of 3 dashboards to help Grafana Cloud users (Pro and Advanced) keep track of their metrics cardinality.
Our cardinality management dashboards give you the ability to analyze your data from a broad perspective to a more targeted view. The idea behind these dashboards is to start with the overview dashboard and then drill down to more detailed information regarding specific metrics or labels. We have three dashboards for this tool:
- Overview dashboard
- Metrics dashboard
- Labels dashboard
“A dashboard says more than 1,000 words, and that is truly the case with cardinality management dashboards,” says Nestlé Tech Lead Reza Farshkaran. “These dashboards not only provide a great way to monitor and manage cardinality but also help my team explain cardinality to our colleagues.”
The overview dashboard shows you cardinality information across metrics and labels for a single selected data source. You would use this dashboard to inform yourself if a certain data source has a high number of series and to get a good idea on where to start looking for the origin of high cardinality. This cardinality dashboard allows you to click on the metric names as well as on the label names, which link you directly to the metric-specific and label-specific dashboards, respectively.
Start thinking about your top 10 metrics and your top 10 labels and ask yourself if you could drop some labels because they provide you with information you are not actively using (i.e., you’re not grouping or filtering by these labels and don’t plan on doing so in the future). Next, ask yourself the same question for the entire metric. Does the metric spark joy or is it gathering dust in your closet?
From a label perspective, it could be interesting for you to think about reducing the number of values per label. For example, imagine you have a web service that returns all kinds of status codes. There are around 50. While you probably never encounter all of those, you might not even need to know the full details and decide for your specific use case to only store the class of status codes, which brings you down to 5 possible values. Or you only store the 6 most common specific status codes you care about and wrap up the others in a single label with the value
An even better solution for reducing the number of values per label or removing the full label could be storing the data of this example as logs by sending the log lines to Grafana Loki. This allows you to reduce the cardinality by dropping the label or the entire metric completely, while still being able to correlate the information to other metrics and aggregate the information to create metrics dashboards with the full information in a much more cost-efficient way.
To summarize, you have 3 options:
- Keep the metric/label because you care about it.
- Remove the metric/label either because you don’t need it or because the information is available in logs or could be made available in logs.
- Reduce the number of values of a certain label.
This dashboard helps you understand the cardinality of an individual metric. At the top of the dashboard, you can select a data source and a metric you want to explore.
Considering the example above with the HTTP status codes, you could potentially identify that the label that keeps track of the status code values is especially high and decide to act upon it as outlined above: keep, remove, or reduce.
This dashboard shows a cardinality report for the selected label. For a given data source and label name, it shows you which label values are attached to the most series. It also shows you the highest cardinality metrics for a given label-value pair.
The labels dashboard is especially useful to show you the most common or the most important label values. It can also help you identify any imbalances of a label value across all the metrics it is being used. You might have decided to keep a certain label without reducing the number of values because it is important to you for specific use cases. However, in the labels dashboard you might identify metrics that use this label which do not provide value to you. Think about removing the entire metric or the label for this specific metric to eventually reduce cardinality.
Another use for this dashboard is helping to understand at a high level where your series are coming from. Let’s say that you use the common label
environment across all your metrics to denote whether they’re coming from
development environments. If you use the labels dashboard to explore the
environment label, you’ll be able to see what fraction of series are being generated by your prod, test, and dev environments. Maybe you notice that 50% of your series are coming from dev, which is unexpected for you. So you may focus on trimming metrics coming from dev.
Other organizations may use labels like
job to identify the source of a metric. Again, using this dashboard to look at these labels can be helpful to understand your distribution of series. Maybe you realize that
application-1 is sending 90% of your series. Or that
team-a is sending 5x more series than
team-b even though they’re half the size.
Keeping cardinality under control
To keep your cardinality under control, the most important thing is to understand which metrics and labels are useful for you and your teams. Don’t store more information than you need. If you find yourself in a situation where you need more information, you can always introduce a new label or a new label value later. The three cardinality dashboards help you to easily dig into that information and help you get the data you need to make the decision of keeping, removing, or reducing.
Let’s do a final example combining everything we have learned so far. Assume we have a metric with 3 labels. In real-life scenarios, it’s often the case that a specific label value only allows specific values of another label. For the sake of keeping this example simple, let’s assume that each label has 10 values, and each combination of these 3 times 10 label values can occur. This leaves us with potentially 10 x 10 x 10 = 1,000 series. Using the cardinality dashboards methodically as described in this article, we could do a couple of things. The table below shows some examples of actions taken, their impact on cardinality, their business impact, and the cost savings of this metric associated with the actions taken.
While this is a rather generic example, you can still see that small changes can make a fairly big impact.
We hope that Grafana’s new cardinality dashboards are useful, and we are looking forward to getting your feedback on how you use them and how you think they can improve to help keep your cardinality under control.
If you’re not already using Grafana Cloud — the easiest way to get started with observability — sign up now for a free 14-day trial of Grafana Cloud Pro, with unlimited metrics, logs, traces, and users, long-term retention, and premium team collaboration features.