What's Next for Observability

Published: 21 Oct 2019 by Michelle Tan RSS

In the industry, the long-held theory behind observability is that a successful stack consists of three key components – metrics, logging, and tracing.

“This is a mental model for people who are often new to observability which helps them get a handle on what they need to implement to be successful,” said Grafana Labs VP of Product Tom Wilkie during a keynote presentation he delivered at KubeCon + CloudNativeCon EU in May alongside Red Hat Software Engineer Frederic Branczyk.

“We have all of this data, and we’re telling people if you have metrics, if you have logs, and if you have tracing, you’ve solved observability,” Branczyk added. “But we don’t think that’s the case. There is so much more for observability to come.”

Below is a recap of the Kubecon keynote and a look at how Wilkie and Branczyk predict observability will evolve and grow in the (very near) future.

The Three Pillars

To start, here’s a quick overview of each pillar.

Pillars Slide

1. Metrics Normally this is time series data that is used in trends for memory usage and latency.

“The CNCF has some great projects in this space,” said Wilkie. “OpenMetrics is an exposition format for exporting metrics from your application, and Prometheus is probably now the defacto monitoring system for Kubernetes and apps on Kubernetes.”

2. Logs Logs, or events, are what comes out of your containers in Kubernetes. The CNCF has Fluentd and there’s log shipping on SQL.

3. Traces “This is potentially the hardest one to sum up in a single sentence,” said Wilkie. “I think of distributed traces as a way of recording and visualizing a request as it traverses through the many services in your application.”

In this space, there is OpenTelemetry as well as Jaeger, a CNCF project which “we use ourselves” at Grafana Labs, said Wilkie.

Prediction 1: More Correlation Between Pillars

“The first prediction is that there will be more correlation between the different pillars,” said Wilkie. “We think this is the year when we’re going to start breaking down the walls and we’re going to start seeing joined up workflows.”

Here are three example workflows and projects that provide automated correlation that you can do today:

Correlation Slide

1. Grafana Loki “The first system is actually a project that I work on myself called Loki,” said Wilkie, who also delivered a separate KubeCon talk about the open source log aggregation system that Grafana Labs launched six months ago. Since then “we have had an absolutely great response. Loads of people have given us really good feedback.”

Loki uses Prometheus’ service discovery to automatically find jobs within your cluster. It then takes that service discovery as well as the labels that the service discovery gives you and associates them with the log stream, preserving its context and saving you money.

“It’s this kind of systematic, consistent metadata that’s the same between your logs and your metrics that enables seamlessly switching between the two” said Wilkie.

2. Elasticsearch & ZipkinElasticsearch is probably the most popular log aggregation system, even I can admit that,” said Wilkie. “And Zipkin is the original open source distributed tracing system.”

Within Kibana, the Elasticsearch UI, there is a function called field formatters which allows someone to hyperlink between tracing and logs.

“Some chap on Twitter set up his Kibana to insert a link using a field formatter so that he could instantly link to his Zipkin traces,” said Wilkie. “I think this is really cool, and I’m really looking forward to adding this kind of feature to Grafana.”

3. OpenTelemetry Recently Google’s OpenCensus and CNCF’s OpenTracing merged into one open source project known as OpenTelemetry now operated by CNCF.

Within OpenTracing, the use of exemplars illustrates another example of correlation. “Exemplars are example trace IDs associated with every bucket in a histogram,” said Wilkie. “So you can see what’s causing high latency and link straight to the trace.”

“I like that OpenTelemetry is open source,” added Wilkie. “I’m not actually aware of an open source system on the server side that has actually implemented this workflow. If I’m wrong, come and find me.” (He’s @tom_wilkie on Twitter!)

Prediction 2: New Signals & New Analysis

Your o11y toolkit doesn’t have to include just three pillars.

“As humans we like the number three. That’s why maybe we’ve settled on the three pillars of observability, but we believe there is so much more data that can help us have insight into running systems,” said Branczyk.

The question is: How do we make all this data useful? The answer lies in the metadata.

“Metadata is what’s going to allow us to do all of this correlation but still build systems that are useful individually. Also as a whole, it will do all this great correlation across the signals,” said Branczyk.

Which is important because “it’s signals and analysis that are going to bring us forward,” said Branczyk, who shared a concrete example of what could be the fourth pillar of observability.

Signals Slide1

Above is a graph of memory created using Prometheus. It shows memory usage over time, and there is a sudden drop in memory. Then there’s a new line that has a different color. In Prometheus, this means it’s a distinct time series, but this is actually the same workload here.

“What we’re seeing in this graph is what we call an OOM kill, when our application has allocated so much memory that the current has said, ‘Stop here; I’m going to kill this process and go on,’” explained Branczyk. “Our existing systems can show us all this data usage over time, and our logs can tell us that the OOM has happened.”

But for app developers, they need more information to know how to fix the problem. “What we want is a memory profile of a point in time during which memory usage was at its highest so that we know which part of our code we need to fix,” said Branczyk.

Imagine if there was a Prometheus-like system that periodically took memory profiles of an application, essentially creating a time series of profiles. “Then if we had taken every 10 seconds or every 15 seconds a memory profile of our OOM-ing application maybe we would actually be able to figure out what has caused this particular incident,” said Branczyk.

Google has published a number of white papers on this, and there is some proprietary software that does this work, but there hasn’t been a solution in the open source space – until now. Branczyk has started a new project on GitHub called Con Prof. (“It stands for continuous profiling because I’m not a very imaginative person,” he joked.)

But how can time series profiling be more useful than just looking at the normal memory profile?

Signals Slide2

Above is a Pprof profile which is what Go runtime provides developers to analyze running systems.

“As it turns out as I was putting together these slides, I found a memory leak in Con Prof,” admitted Branczyk. “But what if Con Prof could have told me which part of Con Prof actually has a memory leak? So if we have all of this data over time modeled as a time series, and if we look at two memory profiles in a consecutive way, Con Prof could identify which systems have allocated more memory over time and haven’t freed it. That potentially could be what we have to fix.”

“I think we’re going to be seeing a lot more signals and analysis,” concluded Branczyk. “I’ve only shown you one example but I think there’s going to be a lot more out there to explore.”

Prediction 3: Rise of Index-Free Log Aggregation

Over the past six months to a year, the feelings around log aggregation have been mutual. “A lot of people have been saying things like just give me log files and grep,” said Wilkie.

“The systems like Splunk and Elasticsearch give us a tremendous amount of power to search and analyze our logs but all of this power comes with a lot of … I’m not going to say responsibility,” joked Wilkie. “It comes with a lot of complexity – and expense.”

Before Splunk and Elasticsearch, logs were just stored as files on disk, possibly in a centralized server, and it was easy to go on and grep it. “I think we’re starting to see the desire for simpler index-free log aggregation systems,” said Wilkie. “Effectively everything old is new again.”

Here Wilkie gives three examples of how this works:

Logs Slide

1. OK Log Peter Bourgon started this project over a year ago but “unfortunately it’s been discontinued,” said Wilkie. “But it had some really great ideas about distributing grep over a cluster of machines and being able to basically just brute force your way through your logs. It made it a lot easier to operate and a lot cheaper to run.”

2. kubectl logs “I think if we squint, we can think of this as a log aggregation system,” said Wilkie. In kubectl logs, there’s a central place to query logs, and it stores them in a distributed way.

“For me the thing that was really missing from kube logs is being able to get logs for pods that were missing – i.e. pods that disappeared or pods that OOM-ed or failed, especially during rolling upgrades,” said Wilkie.

3. Grafana Loki The above problem is what led Wilkie to develop Loki, the index-free log aggregation system designed to be easy to run and easy to scale by Grafana Labs.

“It doesn’t come with the power of something like Elasticsearch,” said Wilkie. “You wouldn’t use Loki for business analytics. But Loki is really there for your developer troubleshooting use case.”

“I’m really hoping that we see the rise of these index-free, developer-focused log aggregation systems,” concluded Wilkie. “And I’m hoping this means, as a developer, I’ll never be told to stop logging so much data again.”

How You Can Help

“The overarching theme of all of this is don’t leave it up to Tom and me. Don’t leave it up to the existing practitioners,” said Branczyk. “This is a community project. Observability was not created by a few people. It was created by people who had lack of tooling in their troubleshooting.”

So the next time you find yourself troubleshooting, think “what data are you looking at, what are you doing to troubleshoot your problem, and can we do that in a systematic way?” said Branczyk. “Hopefully we’ll have more reliable systems as a result.”

With the help of the entire community, said Wilkie, “if we’re lucky, we can watch this talk in a year or two and maybe get one out of three.”