How Loki Correlates Metrics and Logs -- And Saves You Money
The situation is all too familiar: You get an alert. You look at your metrics and your dashboards to try and find out what the cause might be and when the incident actually started (instead of when the alert happened). Then you have to go somewhere else to look at logs because eventually you need more data.
“The main problem with this is that you don’t have one single UI,” explained Grafana Labs Software Engineer Callum Styan at Devopsdays in Vancouver. “You have to jump all over the place to find the data you actually want.”
To add more layers to an already complicated process, there are likely several cross-functional teams who manage data in different ways. “In most orgs, there are probably multiple tools for the same thing. One team might use Splunk while another employs ElasticSearch,” said Styan.
When it comes to accessing logs in ephemeral environments, “some of us are infrastructure engineers but your application engineers might not know how to do that,” said Styan. “What if they want to grep, but they don’t know where to go to get the logs?”
With existing log aggregation solutions, “it’s a pain to even find the logs for the service that I care about,” said Styan. “I would really rather dump the logs or log file to standard out and grep through it.”
1. Cheaper than current options.
“We wanted it to be cheaper than other options – SaaS or otherwise,” said Styan.
Loki does not have an inverted index of all log contents like other aggregation systems that ultimately require doubling the amount of data that needs to be stored. “We only index data that we get from service discovery,” explained Styan. “With Loki, internally we have petabytes of data, and the index is less than one percent of the log data size.”
2. Ease of operations.
“We’re going to run it as a service so we care about operations,” said Styan. The team was determined to incorporate the best of both microservices and monoliths, whether you run a single process on bare metal or microservices at hyperscale.
“Internally we run Loki, based on Cortex, as a set of microservices,” said Styan. “These microservices make the system multi-tenant and horizontally scalable. For example, there’s services for sharding and replicating data, for ingestion, querying, etc.”
3. Easier to find the information you want and correlate it with other observability data.
“We wanted to easily correlate between log data and other observability data,” said Styan.
Loki works similarly to Prometheus, a pull-based metrics and monitoring system. Prometheus knows where your services are running via service discovery and scrapes metrics to export. Because it uses your service discovery, it attaches metadata labels, such as a Kubernetes job name or consul/nomad cluster.
In Loki, Promtail, the agent, runs on each node and grabs log streams of the logs you want to store via service discovery as well. For example, it can grab the standard out stream from every Kubernetes pod on a node. You can then utilize Grafana Explore for ad-hoc queries where you can view both metrics and logs.
Reminder, however: Loki was designed to be a tool for incident investigations. “We’re not trying to replace things like Splunk or ELK if you’re doing analytics or business intelligence purposes,” said Styan. While we have plans to have Loki parse structured logs to find more labels, we don’t recommend generating time series data from log queries with any tool.
Below is a sample investigation scenario of an alert for 99th percentile latency that shows how easily Loki correlates metrics and logs:
Here is a graph of request latency by percentile. The long tail latency is relatively high. Also whatever is causing the latency to spike started around time 20:20.
Here is a graph of queries to our database from our application. The orange spikes are the number of 500 status codes being returned for those queries. Let’s assume these 500-s result in retries in the service we got the alert for.
Here we have gone into Grafana Explore to do an ad-hoc query where we’ve drilled down to the error rate for a particular instance of the DB portion of our application. We are looking for a specific instance of our database application here. In the upper right there is a red arrow pointing to a split button.
This is the view you get – a single location for all your data that makes it easier to correlate between logs and metrics. The instance label that we were looking for in our metrics is copied over, and we can then just select the Loki data source and view the logs for the exact same thing which we were querying our metrics for.
We can do basic regex of our log data on top of that now. For example, when we filter out only error-level logs, we see there are some lock timeouts and too many open connections which are probably causing the issues that we’re seeing. Too many clients means more clients waiting to try and get the lock. Now we can go into the code and fix the issue.
Here is a basic deployment diagram. The blue line in the middle is Promtail talking to Kubernetes service discovery. The red lines are Promtail collecting from logs streams, in this case pods, and the green line is Promtail sending those logs to Loki.
Here is a Kubernetes cluster running some services app1 and app2. The blue line in the middle is Promtail using K8s SD to ask about what pods or services are running. The red lines on the right are Promtail collecting from the log streams it has found via K8s SD. The green line is Promtail sending logs to Loki, and the blue line is Grafana running queries against the Loki data source.