An (only slightly technical) introduction to Loki, the Prometheus-inspired open source logging system
Every application creates logs. Web servers, firewalls, services on your Kubernetes clusters, public cloud services, and more. For companies, being able to collect and analyze these logs is crucial. And the growing popularity of microservices, IoT, cybersecurity, and cloud has brought an explosion of new types of log data. That’s why log management is a huge $2-billion-plus market that’s growing 14% YoY.
Key log analysis use cases
There are a lot of important questions that companies want to ask their logs. These questions can be roughly grouped into 5 key use cases.
Debugging and troubleshooting. To help development, DevOps and SRE teams to get quick answers on questions like:
Why did my application crash?
Why can’t customers reach my service?
Why is my Kubernetes pod restarting constantly?
Monitoring. A lot of folks use Prometheus and Graphite metrics for monitoring, but you can also create metrics from your logs so you can monitor the error rate of a website and get alerted if it goes above a threshold, for example.
Cybersecurity. Hardly a day goes by without news of a large breach. Real-time log analysis can detect hacks in progress, and logs can help in forensics to understand which servers were compromised.
Compliance. Regulated industries need to keep audit logs for 3-5 years to comply with industry regulations. Also, local law enforcement can require retention and access to company logs.
Business intelligence. Creating actionable insights from business data is an evergreen use case. For example, logs can help you understand conversion rates from advertisement channels to your company’s website.
Current log management solutions
In this huge $2B+ market, there are a lot of vendors. Most popular are Splunk (closed source with a huge amount of functionality out of the box) and Elasticsearch (mainly open source with a lot of features, although some assembly is required). Public cloud solutions like AWS Cloudwatch and Google Stackdriver are sensible defaults provided by your cloud provider. And there are a lot of other players like Graylog, Sumologic, and Loggly.
But a huge field of existing players doesn’t mean that log management is a solved problem.
Top 3 challenges with traditional log management tools
1. Hard to operate at scale
As companies are storing an increasing amount of logs, traditional solutions that are not originally designed for petabyte-scale tend to become hard to operate at such high volumes. A good example is Elasticsearch, which started as a full-text search index to support sub-second, complex query results. Although there have been a lot of improvements over time, it’s still nontrivial to operate at scale for most companies. And that impacts the TCO negatively as you need a larger operational team to keep things running smoothly.
Aside from the people cost, traditional log management tools tend to be resource hungry. Think fast CPU, a lot of RAM, and plenty of SSDs. That also adds to the TCO. And to top it off, the licensing costs of proprietary solutions like Splunk can be truly eye-watering.
3. Doesn’t correlate well with Prometheus metrics
Traditional solutions were created before the rise of Prometheus as the standard open source metrics monitoring solution for cloud native and Kubernetes. As the Prometheus query language and data model are very different from, for example, the Elasticsearch or Splunk query languages, this makes correlating metrics and logs harder. A lot of Prometheus users are using Grafana as their visualization and exploration tool, and these differences require them to rebuild query context every time.
Loki is different
These key challenges with traditional log aggregation solutions led Tom Wilkie, Grafana Labs VP of Product, to create Grafana Loki in early 2018. This 100% open source log aggregation tool takes a unique approach in log management, as it only indexes a small bit of metadata from every logline. By design, the Loki data model and query language are also extremely similar to that of Prometheus. These characteristics result in the following benefits for its users:
(Several) orders of magnitude cheaper than traditional solutions to operate and run. (At Grafana Labs, our Hosted Logs solutions powered by Loki is priced at 0.50 USD per GB with 1-month retention).
Horizontally scalable – think petabyte scale.
Great fit with Kubernetes, Prometheus, and visualization within Grafana.
These properties make it a great fit for the debugging and troubleshooting use case. And that resonated within the open source community. Since its first release, Loki has received an enormous amount of attention and a huge influx of production deployments.
Does Loki only index the metadata?
So what does “only indexing the metadata” mean? Let’s compare how Elasticsearch and Loki typically index a single logline from a web server. Here’s our example logline:
10.185.248.71 - - [09/Jan/2015:19:12:06 +0000] 808840 "GET /inventoryService/purchaseItem?userId=20253471&itemId=23434300 HTTP/1.1" 500 17 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
Elasticsearch parses the full string, including high cardinality fields like
user_id, and stores all field values in a big index. The benefit of this approach is that you can execute complex queries very fast, but the downside is that the index is sometimes even bigger than the original data. And that’s inherent to the architecture of Elasticsearch; it is built on the Lucene search engine project, which is designed for low write / high read scenarios. That is: Put a lot of effort into writing to make reading efficient. However, that’s totally the opposite scenario to logs, which are generally high write / low read. Making Elasticsearch scale for high volume log aggregation is doable, but requires a lot of expensive computing resources and a high-level operational complexity.
Example of how Elasticsearch typically indexes the logline:
Timestamp = 09/Jan/2015:19:12:06 +0000 Client_ip = 10.185.248.71 URI = /inventoryService/purchaseItem UserId = 20253471 itemId = 23434300 Method = GET Size = 808840 Duration = 17 HTTPStatus = 500 Protocol = HTTP/1.1 Agent = Apache-HttpClient/4.2.6 (java 1.5) Msg = “10.185.248.71 - - [09/Jan/2015:19:12:06 +0000] 808840 "GET /inventoryService/purchaseItem?userId=20253471&itemId=23434300 HTTP/1.1" 500 17 "-" "Apache-HttpClient/4.2.6 (java 1.5)”
Example of how Loki would index the same logline:
Timestamp = 09/Jan/2015:19:12:06 +0000 Method = GET HTTPStatus = 500
Loki only indexes a few low cardinality fields upfront, so the Loki index is very small. We can query the logs by filtering them by time range and indexed fields (called labels in Loki: check out this recent blog post on Loki labels best practices and then scanning the remaining set of log lines using substring search or with regular expressions. This turns out to be more efficient in the long run and perfectly suitable for the debug and troubleshooting use case where you need to find a needle in the proverbial haystack and correlate your Prometheus metrics with your logs.
Getting started with Grafana Loki
If Grafana Loki’s approach to log aggregation is piquing your interest, check out my (slightly more technical) how-to videos to get you started with Grafana Loki:
Getting started with Grafana Loki - under 4 minutes
Getting started with Grafana Loki on Google Kubernetes Engine - under 5 minutes
How to ship logs to Grafana Loki with Promtail, FluentD & Fluent-bit
A closer look at the new Grafana Metrics and Logs correlations features
For an introduction to Grafana Loki that goes deeper on the architectural decisions, I recommend our latest webinar on Loki.
Big shoutout to Christine Wang, Dave Russell, Julie Dam, Malcolm Holmes, Marcus Olsson, Peter Štibraný, and David Dorman for their help with this blog post.