Grafana Labs at KubeCon: What is the Future of Observability?
The three pillars of observability – monitoring, logging and tracing – are so 2018.
At KubeCon + CloudNativeCon EU last week, Grafana Labs VP Product Tom Wilkie and Red Hat Software Engineer Frederic Branczyk, gave a keynote presentation about the future of observability and how this trifecta will evolve in 2019 and the years to come.
“The three pillars were really meant as a framework for people who have just gotten started on their journey in observability. In 2018 there’s been a conversation – and some critique – about this,” said Branczyk.
“We have all of this data, and we’re telling people if you have metrics, if you have logs, and if you have tracing you’ve solved observability. You have an observable system!” Branczyk explained. “But we don’t think that’s the case. There is so much more for observability to come.”
“It’s always a bit of a risky business doing predictions,” said Wilkie. “But we’re going to give it a go anyway.”
Below is a recap of their Kubecon keynote.
The Three Pillars
To start, here’s a quick overview of each pillar.
1. Metrics Normally this is time series data that is used for trends in memory usage and latency.
“The CNCF has some great projects in this space,” said Wilkie. “OpenMetrics is an exposition format for exporting metrics from your application, and Prometheus is probably now the defacto monitoring system for Kubernetes and apps on Kubernetes.”
2. Logs Logs, or events, are what come out of your containers on “standard out” in Kubernetes. Think error messages, exceptions, and request logs. The CNCF has the Fluentd project, which is a log shipping agent.
3. Traces “This is potentially the hardest one to sum up in a single sentence,” said Wilkie. “I think of distributed traces as a way of recording and visualizing a request as it traverses through the many services in your application.”
In this space, there is OpenTelemetry as well as Jaeger, a CNCF project which Grafana Labs utilizes, according to Wilkie.
Prediction #1: More Correlation Between Pillars
“The first prediction is that there will be more correlation between the different pillars,” said Wilkie. “We think this is the year when we’re going to start breaking down the walls, and we’re going to start seeing joined up workflows.”
Here are three examples of workflows and projects that provide automated correlation that you can do today:
1. Grafana Loki “The first system is actually a project that I work on myself called Loki,” said Wilkie, who also delivered a separate KubeCon talk about the open source log aggregation system that Grafana Labs launched six months ago. Since then “we have had an absolutely great response. Loads of people have given us really good feedback.”
Loki uses Prometheus' service discovery to automatically find jobs within your cluster. It then takes the labels that the service discovery gives you and associates them with the log stream, preserving context and saving you money.
“It’s this kind of systematic, consistent metadata that’s the same between your logs and your metrics that enables switching between the two seamlessly,” said Wilkie.
2. Elasticsearch & Zipkin “Elasticsearch is probably the most popular log aggregation system, even I can admit that,” said Wilkie. “And Zipkin is the original open source distributed tracing system.”
Within Kibana, the Elasticsearch UI, there is a function called field formatters which allows someone to hyperlink between tracing and logs.
“Some chap on Twitter set up his Kibana to insert a link using a field formatter so that he could instantly link to his Zipkin traces,” said Wilkie. “I think this is really cool, and I’m really looking forward to adding this kind of feature to Grafana.”
3. OpenTelemetry Recently Google’s OpenCensus and CNCF’s OpenTracing merged into one open source project known as OpenTelemetry now operated by CNCF.
Within OpenCensus, the use of exemplars illustrates another example of correlation. “Exemplars are trace IDs associated with every bucket in a histogram,” said Wilkie. “So you can see what’s causing high latency and link straight to the trace.”
“I like that OpenTelemetry is open source,” added Wilkie. “I’m not actually aware of an open source system on the server side that has implemented this workflow. If I’m wrong, come and find me.” (He’s @tom_wilkie on Twitter!)
Prediction #2: New Signals & New Analysis
Your observability toolkit doesn’t have to include just three pillars.
Branczyk shared an example of what could be the fourth pillar of observability because “it’s signals and analysis that are going to bring us forward.”
Above is a graph of memory created using Prometheus. It shows memory usage over time, and there is a sudden drop in memory. Then there’s a new line that has a different color. In Prometheus this means it’s a distinct time series, but we’re actually looking at the same workload here.
“What we’re seeing in this graph is actually what we call an OOM kill, when our application has allocated so much memory that the current has said, ‘Stop here; I’m going to kill this process and go on,'” explained Branczyk. “Our existing systems can show us all this data usage over time, and our logs can tell us that the OOM has happened.”
But for app developers, they need more information to know how to fix the problem. “What we want is a memory profile of a point in time during which memory usage was at its highest so that we know which part of our code we need to fix,” said Branczyk.
Imagine if there was a Prometheus-like system that periodically took memory profiles of an application, essentially creating a time series of profiles. “Then if we had taken every 10 seconds or every 15 seconds a memory profile of our OOM-ing application maybe we would actually be able to figure out what has caused this particular incident,” said Branczyk.
Google has published a number of white papers on this topic and there are some proprietary systems that do this work, but there hasn’t been a solution in the open source space – until now. Branczyk has started a new project in GitHub called Conprof. (“It stands for continuous profiling because I’m not a very imaginative person,” he joked.)
But how can time series profiling be more useful than just looking at the normal memory profile?
Above is a Pprof profile which is what Go runtime provides developers to analyze running systems.
“As it turns out as I was putting together these slides, I found a memory leak in Con Prof,” admitted Branczyk. “But what if Conprof could have told me which part of Conprof actually has a memory leak? So if we have all of this data over time modeled as a time series, and if we look at two memory profiles in a consecutive way, Conprof could identify which systems have allocated more memory over time and haven’t freed it. That potentially could be what we have to fix.”
“I think we’re going to be seeing a lot more signals and analysis,” concluded Branczyk. “I’ve only shown you one example but I think there’s going to be a lot more out there to explore.”
Prediction #3: Rise of Index-Free Log Aggregation
Over the past six months to a year, the feelings around log aggregation have been mutual. “A lot of people have been saying things like, ‘Just give me log files and grep,'” said Wilkie.
“The systems like Splunk and Elasticsearch give us a tremendous amount of power to search and analyze our logs but all of this power comes with a lot of … I’m not going to say responsibility. It comes with a lot of complexity – and expense,” said Wilkie.
Before Splunk and Elasticsearch, logs were just stored as files on disk, possibly in a centralized server, and it was easy to go on and grep it. “I think we’re starting to see the desire for simpler index-free log aggregation systems,” said Wilkie. “Effectively everything old is new again.”
Here Wilkie gives three examples (“Again, three is a very aesthetically pleasing number,” he joked) of how this works:
1. OK Log Peter Bourgon started OK Log over a year ago but “unfortunately it’s been discontinued,” said Wilkie. “But it had some really great ideas about distributing grep over a cluster of machines and being able to basically just brute force your way through your logs. It made it a lot easier to operate and a lot cheaper to run.”
2. kubectl logs “I think if we squint, we can think of this as a log aggregation system,” said Wilkie. In kubectl logs, there’s a central place to query logs, and it stores them in a distributed way.
“For me, the thing that was really missing from kube logs is being able to get logs for pods that were missing – i.e. pods that disappeared or pods that OOM-ed or failed, especially during rolling upgrades,” said Wilkie.
3. Grafana Loki The above problem is what led Wilkie to develop Loki, the index-free log aggregation system designed to be easy to run and easy to scale by Grafana Labs.
“It doesn’t come with the power of something like Elasticsearch,” said Wilkie. “You wouldn’t use Loki for business analytics. But Loki is really there for your developer troubleshooting use case.”
“I’m really hoping that in 2019 and 2020, we see the rise of these index-free, developer-focused log aggregation systems,” concluded Wilkie. “And I’m hoping this means, as a developer, I’ll never be told to stop logging so much data again.”
How You Can Help
“The overarching theme of all of this is don’t leave it up to Tom and me. Don’t leave it up to the existing practitioners,” said Branczyk. “This is a community project. Observability was not created by a few people. It was created by people who had lack of tooling in their troubleshooting.”
So the next time you find yourself troubleshooting, think “what data are you looking at, what are you doing to troubleshoot your problem, and can we do that in a systematic way?” said Branczyk. “Hopefully we’ll have more reliable systems as a result.”
With the help of the entire community, said Wilkie, “if we’re lucky, we can watch this talk in a year or two and maybe get one out of three.”