“I’ve run a lot of systems in production, and a lot of what has gone into the Kubernetes project came out of scars that came from running web search in production and running API services,” Brendan Burns, the co-creator of Kubernetes, said at the top of his keynote at GrafanaCon L.A.
Now a distinguished engineer at Microsoft, Burns noted that one of his teams has all of the APIs for Azure coming through it. “Around 40, 50,000 requests a second [are] coming through that system, and if it goes down, everything goes down. So it’s a nerve-wracking experience.”
Horror stories? He’s lived a few. Burns spent the next half hour sharing some lowlights, as well as the lessons he’s learned building and monitoring cloud native systems. But first, he covered how we got to this point in monitoring.
Divide and Conquer with Containers
A recurring theme in software development is the concept of decoupling, or as it’s called in algorithms, divide and conquer. “This notion that there’s an ideal breakup of how software should look, and that we should be very crystal clear about the separation between the various pieces and have good contracts in between them is, I think, at the root of all good engineering,” Burns said.
Through the years, he pointed out, compilers decoupled us from processors, object-oriented programming decoupled libraries from applications, and containers decoupled the application from the operating system.
“We keep making these layers because the whole stack is too damn complicated for anybody to even come close to wrapping their head around,” he said. “We need one thing. We need it to have one good implementation. We need to share our work together. Sharing breeds the ability to specialize.”
The decoupling that containers enable allows for an abstraction layer for new interfaces for things like monitoring. “If I hang the Prometheus scrape API on the edge of my container, suddenly monitoring systems can light up and automatically pull my metrics without really any interaction between the two,” Burns said.
“So people who want to think about monitoring can focus on monitoring,” he added. “I as an application developer can think about the monitoring metrics that I want to expose, and we really don’t have to talk to each other. That’s hugely empowering.” Plus, the container has enabled us to ship monitoring elements as discrete components that can be open source, homegrown, or from a vendor.
“The whole idea of containerizing things, making things into discrete chunks that are decoupled from each other that we can share and reuse, and that the experts can really build, is a great move forward in terms of the monitoring and observability of our systems,” said Burns.
In the real world, though, if you let developers do their own monitoring, chances are, three different people would choose three different ways to do it (or even not do it). “It’s very hard to standardize people,” Burns said.
At some point, companies need to choose one way. “It doesn’t really matter [which way], but what does matter is that there’s standardization, because if you standardize, you get specialization, you get expertise, you get knowledge transfer,” he said. “So that if I move from team A to team B, the query language looks the same.”
With Kubernetes, monitoring can be a cluster-level service. “Before the world of containers and container orchestrators, all your APIs were infrastructure-oriented,” Burns said. “Now we actually have application-oriented APIs, which means that a monitoring system can say, ‘Hey, could you give me all the apps that are running in my cluster? I want to start monitoring them.’ At this point, your app developer has very rich telemetry and monitoring data that is standardized across an entire company, and they have learned nothing, and they have done nothing. That’s pretty amazing.”
Beware the Snowflakes
Despite these improvements, organizations now face the rise of the snowflake cluster. “We need to extend the notion of what it means to monitor something from being, ‘I want to understand how my application is going’ to ‘I want to enforce some stuff to make sure that everybody stays about the same,’” he said.
To that end, Burns has been working with the Open Policy Agent team on a policy controller to enforce policy for Kubernetes. “We can really monitor and lock down all of our clusters and ensure consistency of experience, so that not only is there consistency of monitoring within apps deployed to a single cluster. You can actually ensure that exactly the same version of Grafana, exactly the same version of Prometheus, exactly the same monitoring experiences are deployed across every cluster,” he said.
And now for the real-life stuff:
1. If there’s something in your data that you can’t explain, you should really probably investigate it. Once, when building a search system, Burns recounted, he had a blackbox monitor running that would post a document, search for the document, and repeat. “There was a steady 0.5% failure in the retrieval rate,” he said. “We, being good engineers, came up with all kinds of excuses for why our system was operating perfectly and still would have this little error rate. We made ourselves feel really, really good.”
The day before the launch, he got a panicked call from the VP doing a dry-run demo ahead of a big press event. The VP had posted a document, and it was appearing in and out of the index. Luckily, Burns recognized the bug (“we had some garbage collection that was kicking in too early”) and spent a long night fixing it before the launch.
“After doing all this, I went back to my monitoring, and you know that 0.5% error rate that was there? It was now zero,” Burns said.
The moral of the story? Pay attention to your monitoring. Rationalization is not OK.
2. Set up a release dashboard. At Microsoft, all teams that Burns manages maintain dashboards showing what version of the software is running in every data center they have a presence in. One reason is that if you run a service that is the middleware for an entire cloud, you often have to field questions about bug fixes and feature releases that end up being a drain on the team.
The other reason: Once they had the dashboard, they realized that there were data centers that were weeks out of date. “We didn’t even know, because a build got stalled for long enough, then we started a new release the next week,” Burns said. “But we had never caught up to the old release, and just bad stuff happens.”
Once they set up the dashboard and alerting (for when they fell more than a week behind), they were kept in sync. “All of our bug fixes went out, all of our features went out,” he said. “There was none of this weirdness where somebody hit a particular region and they didn’t see a feature, even though they saw it everywhere else. This is a huge insight that I just want to share because it saved me a ton of trouble and saved my team a ton of trouble.”
3. Blackbox all the things. “It’s the technique of basically treating your system like you’re a customer, going with your customer’s expectations, and monitoring if you’re actually meeting your customers’ expectations,” said Burns. “The trouble is that, like unit tests and like everything else, if you let your engineers build monitoring with knowledge of the system, they will not build monitoring for the places where they have blind spots. Blackbox monitoring will find that kind of stuff.”
4. Beware the flashy demo. “Any of us who’ve lived through ops in the middle of the night or real debugging of real systems knows that something can look really great in a PowerPoint or a demo with the right synthetic data, and in the real world, turn out to really not be that useful,” said Burns. Make sure you have knowledge of the real-world experience as well.
The Future of Monitoring
Burns then turned to what’s next.
“We’re getting closer and closer to this idea that you need to be able to package up a cloud-based application,” he said. The Helm project, which Burns helped lead, is a part of that, and the next step is making sure that when you deploy an app with a Helm chart, monitoring is built in. As we are using more and more off-the-shelf software, he added, we need to make sure that the software comes with monitoring by default.
And returning to the idea of the necessity and benefits of sharing, Burns made a pitch for building reusable components. “There is so much work that we have to do,” he said. “We have to broaden the tent for people who are capable of building distributed systems. If we don’t build reusable componentry, if we don’t view that as part of our job, not only are we going to reinvent the wheel over and over again, we’re not going to make the industry broad enough for people to come in and be successful.”
For more from GrafanaCon 2019, check out all the talks on YouTube.