What recent optimizations in the Prometheus storage engine, TSDB, will enable in the future
At the recent PromCon Online, I gave a review of developments in the space of the Prometheus storage engine, TSDB. In this blog post I am going to recap a bit of the talk and add more insights into what these developments will enable us to do in the future. While the talk contained some of the near-future features, I will be diving even further ahead.
You can watch the talk here:
Present and near future
The TSDB was already well optimized – or at least we thought it was. The past year included a whole lot of work on this front, including bringing down memory consumption to less than half with a bunch of loaded block improvements and memory-mapping of head chunks, and reducing restart times by more than half. We have realized that there is still a lot of room for improvement.
One work leads to another, sometimes by re-using the same idea or by enabling another idea to be implemented.
Take for example reduction in memory used by postings offset table. Similar concepts lead to reduction in memory used by a loaded block when applied to other places, like memory occupied by symbols in the block. This one is particularly useful because with time passing by, and longer retention of data, Prometheus would have consumed more and more memory for the blocks. But now there are no such issues with these optimizations.
Continuing the same theme, snapshotting of in-memory chunks for faster restarts would have been inefficient if not for memory-mapping of in-memory chunks from disk. A chunk is a compressed pack of 120 samples for a series. With memory-mapping, every time a chunk is full, it is flushed to disk and only a reference to it is stored in the memory to access the chunk when needed, hence reducing the memory usage. Since this reduces the number of chunks in memory (i.e., 1 chunk per series), this makes taking snapshots of in-memory data during shutdown much more efficient, and we have decided to add it in. This is bound to reduce the entire restart turnaround time by 80% or more.
Another important topic is the lack of a tool for bulk-importing data. Yes, you heard that right. While the work on this was started by Dipack Panjabi a year ago, and lots of iterations have been made on it. We finally reached a consensus during the recent dev summit on the ways to allow bulk import and the formats to support. Bartek Plotka, a Prometheus maintainer, has taken the lead on finishing the tool, which would support OpenMetrics and a well-defined CSV format.
You can see the trend here; the work on TSDB is in full force, and there are no signs of slowing down. With the momentum that TSDB has gained in terms of adding optimizations and features, we are opening up to experimenting more. Snapshotting of in-memory chunks for faster restarts is one example.
Further into the future
Let’s dive further into the future now and see all possible ways TSDB could grow.
Histograms. While Prometheus supports histograms, they are far from awesome. The cost per bucket is so high that users have to pick an appropriate bucket layout carefully in advance. Even then, the accuracy of quantile estimations is often insufficient, and partitioning a histogram along certain label dimensions can quickly become prohibitively expensive. Since Björn Rabenstein’s favorite topic is histograms, he is currently working on a master plan to improve them greatly in Prometheus.
Talking about chunk encoding, Prometheus 1.x supported various encodings tailored to different data patterns. In Prometheus 2.x all the encodings were removed, and it runs solely on a variety of Facebook’s Gorilla compression. As one compression does not work best for all patterns, it is time to start experimenting with more chunk encodings in Prometheus 2.x. Though this has not been widely discussed with the community or the Prometheus team yet, it’s definitely under consideration. Chunk encodings would open up a whole new set of disk space optimizations, which would trickle down into query optimization. Interestingly, Ben Kochie mentioned supporting int64/uint64 for the sample values (TSDB uses only float64), and supporting multiple chunk encoding would enable this easily.
A little more into the future, shall we?
How about constant memory usage by TSDB? And compactions that happen in a few seconds?
Techniques exist for constant memory usage: For example, compact the head block after it reaches some X MB. But that is not feasible with the existing compaction and indexing method, which requires recalculating the entire index from scratch when we merge multiple blocks. If we have a streaming index and streaming chunks, the compaction of two blocks together would be as simple as concatenating the two index and two chunk files without much computation. But this is not as simple as it looks and would require some research. It is something we can consider tackling in the future.
Epilogue
The improvements made in the past year are just the beginning. There’s much more to come, both for TSDB and Prometheus in general. Before PromCon Online, the Prometheus team had a dev summit; Richard “RichiH” Hartmann blogged about what was discussed and decided. So we’ll have various other enhancements and features to share in the near future.
I will also be speaking about the memory-mapping of in-memory chunks from disk and snapshotting of chunks for faster restarts at KubeCon EU (Virtual) 2020. Make sure you attend it!