The benefits of observability
Grafana Labs cofounder and CEO Raj Dutt was a recent guest on the Designing Enterprise Platforms podcast from Early Adopter Research (EAR), speaking to host Dan Woods about the benefits of observability. The conversation touched on several related topics – including the tactics of observability, platform approaches, and why now is a great time to be part of an open source company.
Why observability is important
Dutt began by explaining that the concept of observability comes from the control theory world. “It’s been co-opted in the software world to mean a new way of monitoring and understanding complex systems and applications infrastructure,” he said, noting that there needed to be a new word “because there has been a fundamental shift in how to deploy, monitor, and support applications.”
Etsy, he said, has been a pioneer in this way of thinking: “It’s all about being able to empower developers and operations teams to deploy their software often safely and preserve user experience.”
When it comes to Grafana Labs, though, “Observability practically means we bring telemetry data together under a seamless experience to help people troubleshoot, understand, and explore the data that is coming out at an increasingly rapid rate from all systems and applications. By telemetry data, we mean the fundamental building blocks of understanding these complex systems, which for us boils down to metrics, logs, and traces.”
Woods agreed that the way Etsy tracks thousands of different metrics was different from anything he’d seen before in a data center, which confirmed to him that observability is a different thing than just operational monitoring. As Dutt put it: “Whereas the old IT operations monitoring were about checks and statuses of things, now systems are so complex and there’s so much data, it’s really a data analytics problem.”
Woods followed up by asking how the artifacts of observability, the dashboards, and all the monitoring affects what goes on in an organization.
According to Dutt, the impact is dramatic both on a practical level (touching operational aspects and capabilities) and a cultural one (breaking down data silos and empowering teams).
On top of that, said Woods, “Observability gave confidence” – and Dutt agreed. For example, if a company like Etsy deployed changes all the time, this new system made it safer to do so, because if their dashboards went crazy, they could figure out right away what happened and roll back.
“[Observability] really puts a lot of demands on how you need to observe your systems,” Dutt added. “It needs to be real time, it needs to be comprehensive, but all of the supporting capabilities around that, including how you develop and deploy and package your software and run your infrastructure, have to also be in place in order to achieve that nirvana.”
But fixing a problem and reducing MTTR is just the start of the benefit of doing observability, said Dutt, who added that it also has a huge cost savings.
“An example would be of people using things like auto scaling and usage-based cloud consumption models,” he explained. “And in order to do that cost effectively, you have to monitor everything in real time and correlate different metrics and logs. The cloud providers don’t make this easy – deliberately – and I think that cost savings in terms of optimizing your infrastructure, that’s another big enabler of having all this data together.”
All in the details
Woods compared the way things work to a nervous system: It detects when more traffic is coming in and triggers an auto-scaling of the infrastructure; when the traffic dies down, the scale of the infrastructure is reduced. “So the idea is that it creates a much more detailed model of activity that you can then respond to,” he said.
“Exactly,” replied Dutt. This helps companies deal when there is an issue that is far less obvious than an outage and that may not affect a bunch of users, such as “a 99th percentile latency type issue or an issue that’s affecting an important query somewhere in the stack,” he said. “And being able to tie together metrics, logs, and traces in an experience that’s seamless and allows you to look at an individual request, that’s really important.” If you’re dealing with an important customer who’s having a problem, being able to have that level of customer support is incredibly worthwhile.
The ability to look down to an individual request, said Dutt, “puts tremendous demands in the level of information that you need to collect, the level of correlation that you need to do, and you need to do that in real time.”
Woods asked why a CEO should care about that.
“So many companies have a web application or a mobile application or an API that is super important to the business,” Dutt replied. “Essentially, if you care about things like user experience, performance, availability, at the end of the day you have to have someone caring about your observability story. Because the reality is, the way we build applications today is much more complicated, much more distributed.”
In order to understand it, he said, you need to be able “to look at the whole system and correlate data and be able to analyze disparate things. Why should a CEO care? For the same reason they care about a viable software strategy.”
The observability and DevOps connection
Woods wondered why observability is so connected to DevOps and asked Dutt if it’s only usable in a DevOps context.
“I wouldn’t say it’s only usable in a DevOps context,” he replied, “but a lot of the agility that we talked about in terms of reacting to change – like the Etsy analogy that you brought up – can only be fully realized if you embrace fundamental organizational changes beyond your observability stack. Observability definitely, as a concept, is applicable to even traditional organizations and traditional software teams, but it really comes into its own in terms of realizing all of its benefits if you are more agile and if you do things like continuous deployment.”
Dutt then explained the difference between traditional monitoring and observability.
“In traditional monitoring, you have enough metrics that maybe a human could actually understand them. In observability, you have potentially thousands, hundreds of thousands, or, as you mentioned, even millions of metrics, and the picture that you’re drawing is one that you’re dealing with in aggregate. Then you may dive deeply in to use those metrics to create a very detailed model for various business purposes.”
Tactics of observability
When it comes to building up observability, Dutt said the view of both Grafana Labs and the Grafana open source project is that “it’s all about the ecosystem.”
In order for the Grafana open source project to become an observability platform and be useful, you need to select things like metrics backends, logs backends, and traces. And within that ecosystem, he said, are many open source projects, including commercial vendors and SaaS vendors.
“What we’re all about as a company is providing choice to our users and our customers so that they can compose a platform of whatever backends and vendors that make sense to them,” Dutt explained.
Grafana Labs, he said, doesn’t see itself as a database vendor.
“We believe that your data will always exist in different databases and will be disparate, and your ongoing consolidation play is probably going to be ongoing for a very long time because there’s never going to be a single database to rule all this data.”
There are many open source projects in the nest, Dutt said, and those include Elasticsearch, InfluxDB, Prometheus, and Graphite. Grafana Labs is very heavily involved in communities and projects – such as the Prometheus project or the Graphite project – “but it’s really important to us as a company that we provide first-class integrations to all these different providers.”
Dutt added that there’s something important to remember: “A lot of other vendors will make similar claims, but they’re still database vendors. In order to use their platform, you have to store all your data in their backend. Grafana allows you to use your existing backends and not move that data, not ETL that data out, not batch load that data into Grafana. Grafana will, in real time using the native Splunk APIs or the native Elasticsearch APIs, pull the relevant data when you need it for the analysis that you need, for the dashboard that you loaded, for the alert that you need to run. We don’t have a rip-and-replace mentality.”
The Grafana ecosystem
According to Dutt, Grafana can sit on top of 42 different databases, and that includes SQL. Grafana, he said, “is basically software to allow you to visualize and analyze your data from all of these different databases.”
Woods theorized that a company with many different departments could be so excited by Grafana’s capabilities that, over time, they could have 50,000 metrics to define. So how does Grafana make it possible to organize them so users can dive into detail when necessary?
“It’s all domain-specific,” explained Dutt. “The raw data is pretty useless, but depending on what you’re doing, the aggregate views can be very useful. For every server you have, you would probably collect dozens of metrics each. But you would generally never look at the individual CPU usage of a single Docker container unless you were troubleshooting something. So it’s all about starting with top-level stuff and creating both dashboards and exploratory views that show high-level status and allowing people to drill down lower and lower.”
Metrics alone don’t always get you there. You would typically start with an alert, Dutt said, then “look at broad metrics that are generally dashboards that are high-level status of your systems, and then you’ll generally dive down to more detailed metrics and then you’ll switch to logs. And then you’ll probably switch to traces. That whole experience of contextualized switching within these observability primitives happens within Grafana.”
This is all structured around a variety of dashboards, and Dutt said Grafana is adding more Explore functions to allow for easier impromptu analysis.
Observability as a data and analytics problem
“The scale is going through the roof,” Dutt said. “It’s no longer about an individual metric or one thing being off or something like that. It’s both because of the scale that we’re dealing with, but that scale is driven by the complexity of people’s infrastructure. People used to have a few dozen servers sitting in a rack, a co-lo in a data center somewhere. And then it went to a few hundred VMs with VMware, Zen. Then it went to thousands of containers and multiple availability zones and serverless, and so it doesn’t matter anymore what a particular metric is.”
What people care about, he said, is a deeper level of analytics, which requires asking complicated questions, such as: “Across the customers I care about, how many of them are having an elevated bad time right now in the last five minutes based on this?”
“You may have to touch thousands of metrics to answer that question,” said Dutt, “but you don’t want to see any of those metrics anymore. You just want to be able to ask the question and understand the system in the way that you want to. These systems have become like organisms that you have to look at the health and the state of in aggregate, through data analysis rather than the status of any one metric.”
The (potential) role of AI
Given that users have to create models in order to understand the story being told by the data, and then be able to judge the significance of what’s in the model before they can figure out what requires attention, Woods wondered if it’s inevitable that machine learning and AI will play a role when it comes to using the metrics at scale.
Grafana Labs is “experimenting with and playing around with things that do more predictive analytics or pre cog,” Dutt said, but he believes there’s a bigger issue to tackle first: Most organizations’ best practice methods need to be vastly improved when it comes to “establishing what’s normal, and what should be monitored, and what the high-level metrics are.”
As Woods put it, “There’s a lot needed to be done to grow the nervous system before you start making the brain better.”
Consolidated approach vs. platform approach
Woods asked Dutt to break down the difference between these two approaches – and explain why a consolidated approach is problematic.
Consolidation, said Dutt, means “you’re going to move all of your data into a ’next gen’ database that’s going to handle all of your use cases for observability – whether that’s a SaaS vendor, whether that’s an on prem vendor, whether that’s a vendor that’s coming from logs or metrics or traces or an open source project.”
This also could be looked at as a productized approach or what Dutt calls “a monolith – like everything in one box, the opposite of composable.”
A platform approach, on the other hand, is composable in terms of “the interoperability that we provide with your data wherever it lives,” he said. “So you can connect any and all of the several dozen data sources that work with Grafana or write your own data source since they’re open source, and basically compose an observability platform that is not a monolith.”
In the open source world, Dutt believes a monolith consolidated approach “is just a nonstarter.”
The reason? “There’s so much innovation happening within the open source ecosystem with projects like Prometheus and Elasticsearch,” he said, and Grafana wants to be able to “leverage all of the innovation when it comes to metrics backends, logging backends, and tracing backends” and allow people to make their own choices.
Woods asked for an example of when it might make sense to use different databases or different components, and Dutt said the “obvious example” would be using Graphite for metrics and Elasticsearch for logs. You’re using two open source projects for their individual strengths and capabilities, but you want to bring them into one experience and one view – and Grafana can help do that.
Another common scenario, said Dutt, is using Graphite and Prometheus together. A Graphite installation could be running fine for years, but if you have a team that’s playing around with Kubernetes, they’re likely using Prometheus.
“So boom, you’ve got Prometheus and Graphite running. What do you do? A lot of times you want to bring those metrics together, and so Grafana is the answer to that. Most Graphite users are already using Grafana and most Prometheus users are already using Grafana too, so it’s kind of an obvious play.”
The open source obstacle
Woods then shifted the conversation to a New York Times article that discussed how Amazon is being criticized for “strip-mining” open source software startups.
“If you’re going to do a product – a commercial product – based on open source, you have to have not only a model for creating the software, you have to have a model for capturing the value,” he said. And this leads to a quandary: How do you stay a legit open source company instead of just being open source as a distribution model?
“There’s no easy or glib answer,” Dutt said. “We differentiate our open source software with commercially licensed software that is not open source – it’s called Grafana Enterprise.”
As for “value capture,” he noted that “open source itself has never been about value capture; it’s been about value creation.” However, some features are held back for the enterprise version of the software. “The features that are in Enterprise will appeal to the largest companies in the world, like the top one percent of our users,” Dutt explained.
“As far as the whole strip-mining argument,” he continued, “I don’t think it’s fair to characterize it as a complete negative, as you have to acknowledge the value creation happening if it’s done in the right way. I would like to think that there’s a way that the relationship between some of the cloud vendors and some of the open source companies can become something that could work in the long term where the innovation and the community could be realigned. But for us, I would say that we have our commercially differentiated software, and we will continue to do so because we look at our open source projects as primarily about value creation, and when you try to have both, I think it really complicates things.”
Future trends
The interview concluded with Woods asking Dutt about “the positive and negative trends going on right now in open source for the enterprise and with respect to the public cloud.”
According to Dutt, it’s the “best time ever” to be an open source company in the enterprise software infrastructure space. “What’s changed is certainly within the observability tool set or ecosystem, as the cutting edge stuff now is all happening in open source, whereas 10 years ago open source was your cheap and cheerful alternative to the commercial vendors. And with the larger vendors, it is encouraging to see that whether it’s data analytics, or observability, or open source developer tooling, they’re also becoming more involved in the project because they see that as a way that actually helps them with their use of the project.”