The concise guide to Loki: How to work with out-of-order and older logs
For this week’s installment of “The concise guide to Loki,” I’d like to focus on an interesting topic in Grafana Loki’s history: ingesting out-of-order logs.
Those who’ve been with the project a while may remember a time when Loki would reject any logs that were older than a log line it had already received. It was certainly a nice simplification to Loki’s internals, but it was also a big inconvenience for a lot of real world use cases.
It was such an inconvenience that we decided to solve this problem a few years ago. At the time, Owen Diehl wrote a nice blog about this in which he detailed the engineering considerations that went into the solution. It’s worth a read if you’d like to know the nitty-gritty of how the solution was built, but the high-level summary goes something like this: Engineering is always full of tradeoffs, and we tried to balance performance, operational cost, and functionality to produce a solution we thought worked best for Loki and its users.
In today’s post I’d like to talk about those tradeoffs a bit — not so much why we made them, but how they impact the way you use Loki. I’ll use some animations to help visualize how this works and explain what you should expect when ingesting older logs into Loki. So buckle up and prepare yourself for: “The concise guide to ordering!”
How out-of-order ingestion works
As I described in the first post in this series, Loki’s design is focused on being fast and low cost at ingestion time. It also generally expects to be working with recent data. (i.e., The data it receives is timestamped within a few seconds to maybe a few minutes of the current time.) As a result, our primary goal with the out-of-order solution was to make sure that Loki accepted and stored delayed logs within a reasonable amount of time.
We defined “reasonable amount of time” to be one hour. Technically it’s defined as
max_chunk_age/2, but you should be running a
max_chunk_age of two hours like we do at Grafana Labs (more on this later). Loki’s rule for out-of-order log ingestion works like this:
Any logs received within one hour of the most recent log received for a stream will be accepted and stored. Any log more than one hour older than the most recent log received for a stream will be rejected with an error that reads: “Entry too far behind.”
There are two really important concepts in the statement above that cause the most confusion around how Loki accepts older logs. I bolded them both:
most recent log and
for a stream.
As the name implies,
most recent log refers to the log received by Loki with the most recent timestamp. Sounds simple enough, right? Well, the confusion comes in when you pair that with
for a stream. That’s because Loki determines the most recent log on a stream-by-stream basis, so you can’t automatically assume your streams are all aligned to the same time.
Perhaps some more poorly built but animations could be helpful here! First let’s look at a single stream:
Now let’s extend that example to include multiple streams:
Hopefully this helps clarify how Loki receives older logs!
There are a few important configurations related to how Loki handles ordering of log entries, so let’s take a look at those next.
Ingesting older data
In the limits_config section of Loki’s configuration there are two settings that control the ingestion of older data:
Interestingly these default values are largely tied to Loki’s legacy of being able to use Amazon DynamoDB as an index type. With DynamoDB, Loki used to change the throughput provisioning on older tables to save cost, which meant it could no longer accept data for older time ranges.
Since Loki 2.0 which was released several years ago, we have moved away from these external index types. If you want to send Loki logs older than a week, or from any time period for that matter, feel free to change
reject_old_samples: false or make the window larger.
In the opening I mentioned that the window for out-of-order data is:
max_chunk_age / 2
max_chunk_age can be found in the ingester config.
The default here is two hours. I strongly discourage increasing this value to create a larger out-of-order window. At Grafana Labs we do not run Loki this way so I can’t easily tell you what challenges you may face. I recommend trying to find a way to use labels to separate the streams so they can be ingested separately.
If you do decide to ignore my advice, be sure to increase the value of
query_ingesters_within to match your
match_chunk_age to make sure this data is queryable.
Querying caveats for ingesting older data
There is an important consideration to be aware of when ingesting older data into Loki. Specifically, if you want Loki to ingest data older than two hours from the current time, this data will not be immediately queryable.
This is because of the configuration we mentioned in the last section:
query_ingesters_within. Perhaps it’s a somewhat confusingly named configuration, but it essentially means “only query ingesters within this time window from the current time.” The default here is three hours; any query for data within three hours of the current time will be sent to the ingesters, anything outside that window will not.
Loki is built this way because we know the
max_chunk_age an ingester will allow is two hours. Therefore, we can save significant wear and tear on the ingesters by not asking them for data we know they won’t have. In other words, if you run a query for logs from yesterday, we don’t ask the ingester for them because it shouldn’t have logs from yesterday.
The problem you may see here is that if you are sending Loki logs from yesterday, then those logs are in the ingester until they are flushed to storage (which is at most two hours). Until they are flushed, they will not return in a query result.
You could change this configuration value to something larger, like 48 hours, so you can query your logs from yesterday as they are ingested today. But I really advise against this as you will be forcing Loki to query the ingesters for all queries within 48 hours. This forces your ingesters to do much more work, making them more expensive to operate. It also hurts performance in the normal scenario where they won’t have any data beyond two hours.
If you are ingesting old data for backfill operations, I suggest just waiting the two hours for everything to flush. If you have a normal operation where you are ingesting older logs and don’t like waiting, then you might consider changing
query_ingesters_within, but realize this may have some negative operational tradeoffs on cost and performance.
I hope this helped explain how both out-of-order, and older log ingestion works in Loki. We really tried to strike the right balance between supporting Loki’s typical use case and accommodating other use cases — all without increasing cost or sacrificing performance, at least as best we could.
Come back again next week for another installment in our Loki guide!