How labels in Loki can make log queries faster and easier

Published: 21 Apr 2020

For the majority of the first year that we worked on the Loki project, the questions and feedback seemed to come from people who were familiar with Prometheus. After all, Loki is like Prometheus – but for logs!

Recently, however, we are seeing more people trying out Loki who have no Prometheus experience, and many are coming from systems with much different strategies for working with logs. This brings with it a lot of questions about one very important Loki concept that even Prometheus experts will want to learn more about: Labels!

This post is going to cover a lot of ground to help everyone who is new to Loki and anyone who wants a refresher course. We will dig into the following topics:

What is a label?

Labels are key value pairs and can be defined as anything! We like to refer to them as metadata to describe a log stream. If you are familiar with Prometheus, there are a few labels you are used to seeing like job and instance, and I will use those in the coming examples.

The scrape configs we provide with Loki define these labels, too. If you are using Prometheus, having consistent labels between Loki and Prometheus is one of Loki’s superpowers, making it incredibly easy to correlate your application metrics with your log data.

How Loki uses labels

Labels in Loki perform a very important task: They define a stream. More specifically, the combination of every label key and value defines the stream. If just one label value changes, this creates a new stream.

If you are familiar with Prometheus, the term used there is series; however, Prometheus has an additional dimension: metric name. Loki simplifies this in that there are no metric names, just labels, and we decided to use streams instead of series.

Let’s take an example:

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog

This config will tail one file and assign one label: job=syslog. You could query it like this:

{job=”syslog”}

This will create one stream in Loki.

Now let’s expand the example a little:

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      __path__: /var/log/apache.log

Now we are tailing two files. Each file gets just one label with one value so Loki will now be storing two streams.

We can query these streams in a few ways:

{job=”apache”} <- show me logs where the job label is apache
{job=”syslog”} <- show me logs where the job label is syslog
{job=~”apache|syslog”} <- show me logs where the job is apache **OR** syslog

In that last example, we used a regex label matcher to log streams that use the job label with two values. Now consider how an additional label could also be used:

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      env: dev
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      env: dev
      __path__: /var/log/apache.log

Now instead of a regex, we could do this:

{env=”dev”} <- will return all logs with env=dev, in this case this includes both log streams

Hopefully now you are starting to see the power of labels. By using a single label, you can query many streams. By combining several different labels, you can create very flexible log queries.

Labels are the index to Loki’s log data. They are used to find the compressed log content, which is stored separately as chunks. Every unique combination of label and values defines a stream, and logs for a stream are batched up, compressed, and stored as chunks.

For Loki to be efficient and cost-effective, we have to use labels responsibly. The next section will explore this in more detail.

Cardinality

The two previous examples use statically defined labels with a single value; however, there are ways to dynamically define labels. Let’s take a look using the Apache log and a massive regex you could use to parse such a log line:

11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
- job_name: system
   pipeline_stages:
      - regex:
        expression: "^(?P<ip>\\S+) (?P<identd>\\S+) (?P<user>\\S+) \\[(?P<timestamp>[\\w:/]+\\s[+\\-]\\d{4})\\] \"(?P<action>\\S+)\\s?(?P<path>\\S+)?\\s?(?P<protocol>\\S+)?\" (?P<status_code>\\d{3}|-) (?P<size>\\d+|-)\\s?\"?(?P<referer>[^\"]*)\"?\\s?\"?(?P<useragent>[^\"]*)?\"?$"
    - labels:
        action:
        status_code:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      env: dev
      __path__: /var/log/apache.log

This regex matches every component of the log line and extracts the value of each component into a capture group. Inside the pipeline code, this data is placed in a temporary data structure that allows using it for several purposes during the processing of that log line (at which point that temp data is discarded). Much more detail about this can be found here.

From that regex, we will be using two of the capture groups to dynamically set two labels based on content from the log line itself:

action (e.g. action=”GET”, action=”POST”) status_code (e.g. status_code=”200”, status_code=”400”)

And now let’s walk through a few example lines:

11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"

In Loki the following streams would be created:

{job=”apache”,env=”dev”,action=”GET”,status_code=”200”} 11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
{job=”apache”,env=”dev”,action=”POST”,status_code=”200”} 11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
{job=”apache”,env=”dev”,action=”GET”,status_code=”400”} 11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
{job=”apache”,env=”dev”,action=”POST”,status_code=”400”} 11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"

Those four log lines would become four separate streams and start filling four separate chunks.

Any additional log lines that match those combinations of label/values would be added to the existing stream. If another unique combination of labels comes in (e.g. status_code=”500”) another new stream is created.

Imagine now if you set a label for ip. Not only does every request from a user become a unique stream. Every request with a different action or status_code from the same user will get its own stream.

Doing some quick math, if there are maybe four common actions (GET, PUT, POST, DELETE) and maybe four common status codes (although there could be more than four!), this would be 16 streams and 16 separate chunks. Now multiply this by every user if we use a label for ip. You can quickly have thousands or tens of thousands of streams.

This is high cardinality. This can kill Loki.

When we talk about cardinality we are referring to the combination of labels and values and the number of streams they create. High cardinality is using labels with a large range of possible values, such as ip, or combining many labels, even if they have a small and finite set of values, such as using status_code and action.

High cardinality causes Loki to build a huge index (read: $$$$) and to flush thousands of tiny chunks to the object store (read: slow). Loki currently performs very poorly in this configuration and will be the least cost-effective and least fun to run and use.

Optimal Loki performance with parallelization

Now you may be asking: If using lots of labels or labels with lots of values is bad, how am I supposed to query my logs? If none of the data is indexed, won’t queries be really slow?

As we see people using Loki who are accustomed to other index-heavy solutions, it seems like they feel obligated to define a lot of labels in order to query their logs effectively. After all, many other logging solutions are all about the index, and this is the common way of thinking.

When using Loki, you may need to forget what you know and look to see how the problem can be solved differently with parallelization. Loki’s superpower is breaking up queries into small pieces and dispatching them in parallel so that you can query huge amounts of log data in small amounts of time.

This kind of brute force approach might not sound ideal, but let me explain why it is.

Large indexes are complicated and expensive. Often a full-text index of your log data is the same size or bigger than the log data itself. To query your log data, you need this index loaded, and for performance, it should probably be in memory. This is difficult to scale, and as you ingest more logs, your index gets larger quickly.

Now let’s talk about Loki, where the index is typically an order of magnitude smaller than your ingested log volume. So if you are doing a good job of keeping your streams and stream churn to a minimum, the index grows very slowly compared to the ingested logs.

Loki will effectively keep your static costs as low as possible (index size and memory requirements as well as static log storage) and make the query performance something you can control at runtime with horizontal scaling.

To see how this works, let’s look back at our example of querying your access log data for a specific IP address. We don’t want to use a label to store the IP. Instead we use a filter expression to query for it:

{job=”apache”} |= “11.11.11.11”

Behind the scenes, Loki will break up that query into smaller pieces (shards), and open up each chunk for the streams matched by the labels and start looking for this IP address.

The size of those shards and the amount of parallelization is configurable and based on the resources you provision. If you want to, you can configure the shard interval down to 5m, deploy 20 queriers, and process gigabytes of logs in seconds. Or you can go crazy and provision 200 queriers and process terabytes of logs!

This trade-off of smaller index and parallel brute force querying vs. a larger/faster full-text index is what allows Loki to save on costs versus other systems. The cost and complexity of operating a large index is high and is typically fixed – you pay for it 24 hours a day if you are querying it or not.

The benefits of this design mean you can make the decision about how much query power you want to have, and you can change that on demand. Query performance becomes a function of how much money you want to spend on it. Meanwhile, the data is heavily compressed and stored in low-cost object stores like S3 and GCS. This drives the fixed operating costs to a minimum while still allowing for incredibly fast query capability!

Best practices

Loki is under active development, and we are constantly working to improve performance. But here are some of the most current best practices for labels that will give you the best experience with Loki.

1. Static labels are good

Things like, host, application, and environment are great labels. They will be fixed for a given system/app and have bounded values. Use static labels to make it easier to query your logs in a logical sense (e.g. show me all the logs for a given application and specific environment, or show me all the logs for all the apps on a specific host).

2. Use dynamic labels sparingly

Too many label value combinations leads to too many streams. The penalties for that in Loki are a large index and small chunks in the store, which in turn can actually reduce performance.

To avoid those issues, don’t add a label for something until you know you need it! Use filter expressions ( |= “text”, |~ “regex”, …) and brute force those logs. It works – and it’s fast.

From early on, we have set a label dynamically using promtail pipelines for level. This seemed intuitive for us as we often wanted to only show logs for level=”error”; however, we are re-evaluating this now as writing a query. {app=”loki”} |= “level=error” is proving to be just as fast for many of our applications as {app=”loki”,level=”error”}.

This may seem surprising, but if applications have medium to low volume, that label causes one application’s logs to be split into up to five streams, which means 5x chunks being stored. And loading chunks has an overhead associated with it. Imagine now if that query were {app=”loki”,level!=”debug”}. That would have to load way more chunks than {app=”loki”} != “level=debug”.

Above, we mentioned not to add labels until you need them, so when would you need labels?? A little farther down is a section on chunk_target_size. If you set this to 1MB (which is reasonable), this will try to cut chunks at 1MB compressed size, which is about 5MB-ish of uncompressed logs (might be as much as 10MB depending on compression). If your logs have sufficient volume to write 5MB in less time than max_chunk_age, or many chunks in that timeframe, you might want to consider splitting it into separate streams with a dynamic label.

What you want to avoid is splitting a log file into streams, which result in chunks getting flushed because the stream is idle or hits the max age before being full. As of Loki 1.4.0, there is a metric which can help you understand why chunks are flushed sum by (reason) (rate(loki_ingester_chunks_flushed_total{cluster="dev"}[1m])).

It’s not critical that every chunk be full when flushed, but it will improve many aspects of operation. As such, our current guidance here is to avoid dynamic labels as much as possible and instead favor filter expressions. For example, don’t add a level dynamic label, just |= “level=debug” instead.

3. Label values must always be bounded

If you are dynamically setting labels, never use a label which can have unbounded or infinite values. This will always result in big problems for Loki.

Try to keep values bounded to as small a set as possible. We don’t have perfect guidance as to what Loki can handle, but think single digits, or maybe 10’s of values for a dynamic label. This is less critical for static labels. For example, if you have 1,000 hosts in your environment it’s going to be just fine to have a host label with 1,000 values.

4. Be aware of dynamic labels applied by clients

Loki has several client options: Promtail (which also supports systemd journal ingestion and TCP-based syslog ingestion), FluentD, Fluent Bit, a Docker plugin, and more!

Each of these come with ways to configure what labels are applied to create log streams. But be aware of what dynamic labels might be applied. Use the Loki series API to get an idea of what your log streams look like and see if there might be ways to reduce streams and cardinality. Details of the Series API can be found here, or you can use logcli to query Loki for series information.

5. Configure caching

Loki can cache data at many levels, which can drastically improve performance. Details of this will be in a future post.

6. Logs must be in increasing time order per stream

One issue many people have with Loki is their client receiving errors for out of order log entries. This happens because of this hard and fast rule within Loki:

  • For any single log stream, logs must always be sent in increasing time order. If a log is received with a timestamp older than the most recent log received for that stream, that log will be dropped.

There are a few things to dissect from that statement. The first is this restriction is per stream. Let’s look at an example:

{job=”syslog”} 00:00:00 i’m a syslog!
{job=”syslog”} 00:00:01 i’m a syslog!

If Loki received these two lines which are for the same stream, everything would be fine. But what about this case:

{job=”syslog”} 00:00:00 i’m a syslog!
{job=”syslog”} 00:00:02 i’m a syslog!
{job=”syslog”} 00:00:01 i’m a syslog!  <- Rejected out of order!

Uh-oh … but what can we do about this? What if this was because the sources of these logs were different systems? We can solve this with an additional label which is unique per system:

{job=”syslog”, instance=”host1”} 00:00:00 i’m a syslog!
{job=”syslog”, instance=”host1”} 00:00:02 i’m a syslog!
{job=”syslog”, instance=”host2”} 00:00:01 i’m a syslog!  <- Accepted, this is a new stream!
{job=”syslog”, instance=”host1”} 00:00:03 i’m a syslog!  <- Accepted, still in order for stream 1
{job=”syslog”, instance=”host2”} 00:00:02 i’m a syslog!  <- Accepted, still in order for stream 2

But what if the application itself generated logs that were out of order? Well, I’m afraid this is a problem. If you are extracting the timestamp from the log line with something like the promtail pipeline stage, you could instead not do this and let Promtail assign a timestamp to the log lines. Or you can hopefully fix it in the application itself.

But I want Loki to fix this! Why can’t you buffer streams and re-order them for me?! To be honest, because this would add a lot of memory overhead and complication to Loki, and as has been a common thread in this post, we want Loki to be simple and cost-effective. Ideally we would want to improve our clients to do some basic buffering and sorting as this seems a better place to solve this problem.

It’s also worth noting that the batching nature of the Loki push API can lead to some instances of out of order errors being received which are really false positives. (Perhaps a batch partially succeeded and was present; or anything that previously succeeded would return an out of order entry; or anything new would be accepted.)

7. Use chunk_target_size

This was added earlier this year when we released v1.3.0 of Loki, and we’ve been experimenting with it for several months. We have chunk_target_size: 1536000 in all our environments now. This instructs Loki to try to fill all chunks to a target compressed size of 1.5MB. These larger chunks are more efficient for Loki to process.

A couple other config variables affect how full a chunk can get. Loki has a default max_chunk_age of 1h and chunk_idle_period of 30m to limit the amount of memory used as well as the exposure of lost logs if the process crashes.

Depending on the compression used (we have been using snappy which has less compressibility but faster performance), you need 5-10x or 7.5-10MB of raw log data to fill a 1.5MB chunk. Remembering that a chunk is per stream, the more streams you break up your log files into, the more chunks that sit in memory, and the higher likelihood they get flushed by hitting one of those timeouts mentioned above before they are filled.

Lots of small, unfilled chunks are currently kryptonite for Loki. We are always working to improve this and may consider a compactor to improve this in some situations. But, in general, the guidance should stay about the same: Try your best to fill chunks!

If you have an application that can log fast enough to fill these chunks quickly (much less than max_chunk_age), then it becomes more reasonable to use dynamic labels to break that up into separate streams.

Summary

I’m going to close by beating this dead horse one last time!

Favor parallelization for performance, not labels and the index

Be stringent with labels. Static labels are generally fine, but dynamic labels should be used sparingly. (Or not at all!) If your log streams are writing at 5-10MB a minute, then consider how a dynamic label could split that into two or three streams, which can improve query performance. If your volume is less, stick to filter expressions.

The index is not necessarily the path to performance in Loki! Prioritize parallelization and LogQL query filtering first.

Remember: Loki requires a different way of thinking when compared to other log storage solutions. We are optimizing Loki for fewer streams and a smaller index which helps fill larger chunks that are easier to query via parallelization.

We are actively improving Loki and investigating ways to do so. Be sure to keep checking back in as the Loki story unfolds, and we all figure out how to make the best of this really effective tool!

Want to learn more about Loki?

Sign up for our Intro to Loki webinar scheduled for Wednesday, April 22 at 9:30am PT/16:30 UTC. The agenda includes an overview of how Loki works, basic configs and setup to run Loki and test it out, how to use Loki from Grafana, an introduction to querying, and a Q&A with Loki team members.

Related Posts

Machine learning company ML6 used Grafana to introduce end-to-end visualization to their client Accolade Wines--and to uncork new ways to save money and the planet.
Now it's easier than ever to create an application dashboard made only with logs.
At FOSDEM 2020, I did a deep dive into the secret history of histograms in Prometheus.