How Prometheus monitoring mixins can make effective observability strategies accessible to all

Published: 14 Jan 2021

Three years ago, Tom Wilkie and Frederic Branczyk sketched out the idea for Prometheus monitoring mixins. This is a jsonnet-based package format for grouping and distributing logically related Grafana dashboards with Prometheus alerts and rules. The premise was that the observability world needed a way for system authors to not only emit metrics, but also provide guidance on how to use those metrics to monitor their systems properly. 

The single most important thing mixins introduced in pursuit of that premise was the jsonnet-enabled capability for users to customize mixins on the fly by injecting and overriding variables. The ability to inject and override variables makes it possible for upstream authors to create useful objects without having to decide whether, say, your etcd clusters are supposed to have 3 or 5 nodes.

The importance of this effort has only grown since then. Cloud native’s explosive growth has continued, and with it, monotonic growth of complexity and quantity of our production systems. For all but the largest of players, that growth drives effective observability strategies further out of reach, and with it, operational confidence. Mixins have the potential to mitigate or reverse that trend by distributing the OSS portion of that responsibility to the OSS community itself, rather than requiring each organization to build the same operational artifacts over and over.

And mixins work! The Kubernetes mixin sets the high-water mark for mixins, demonstrating that the approach is fundamentally capable of standing up even to the towering complexity of Kubernetes. We use mixins widely at Grafana Labs. With the assistance of Tanka, we manage many of our cloud platform’s o11y objects with mixins. They also power the integrations that make it easy to get started quickly with metrics, logs, and traces on Grafana Cloud — and we’re now shipping them at a quick and accelerating pace!

But the “long tail” of mixins is… well, neither as long nor organic as we’d like. Most of the mixins we’re aware of are written by Tom, Fred, or folks who know or work with them. If mixins are ever to realistically counterbalance the ongoing complexity explosion of cloud native, that needs to change.

So, in the spirit of Grafana Labs’ ethos of maintaining openness and honesty with the OSS world that sustains us, let’s dive into some uncertain future-talk!

The challenges with mixins

Increasing adoption is probably more art than science. It’s rarely clear what the most important limiting factor(s) are, and therefore, where effort should be directed. But we’ve been discussing this intensely for many months within Grafana Labs, and a few things are coming into focus.

First, producing mixins is neither horribly difficult, nor trivial. mixtool helps with minor scaffolding and CI checking, but jsonnet largely entails a text-centric workflow. It’s up to the user to figure out how to set up a functioning environment in which to create and test any rules, dashboards, and alerts. And if there’s any lesson to be taken from the essential importance of interactive data science tooling like Jupyter notebooks, it’s that we can’t reasonably expect people to query, manipulate, or otherwise work with data without being able to immediately observe the effects of their actions.

Consuming mixins has related challenges. Text-centrism means that we’re asking users to mentally evaluate jsonnet into dashboard JSON and rule/alert YAML, then predict the effect of applying those to Grafana and Prometheus. It’s a nontrivial cognitive load even for the most knowledgeable folks, and for the rest of us, we really need to see the effects of applying a PR before we can be truly confident about approving it.

This isn’t unsolvable, of course — “Run a dev instance of Grafana and Prometheus to test things out!” But that exacerbates a workflow problem. Mixins tacitly assume your org has worked out some smooth, structured path for injecting appropriate variables into mixin jsonnet, and transforming the result into running Grafana and Prometheus objects. Again, we rely on Tanka to handle this at Grafana Labs, but Tanka’s scope is broader than just this, and may not fit neatly alongside other infrastructure management tool choices an organization has already made.

Finally, even under the best of circumstances, generic dashboards/alerts/rules provided from some upstream source, are… well, generic. No matter how many variables the mixin lets you pass in, some things really just take local knowledge to answer. (What panels should be on this dashboard of mine? Why?) They can be useful, even essential, when you’re getting started, but folks generally reach a point where they need to tweak, if not roll their own. Jsonnet allows this, but merging and overriding config from every-which-place get convoluted and messy, further exacerbating the aforementioned consumption challenges. And while you can always just make a copy of a dashboard, such copies are severed from the release train, and you get no upstream updates.

My intuition is that it’s this combination of day-to-day use and maintenance challenges, plus the complexity of creating a smooth workflow, that make the ROI on mixins dubious. And because it’s so easy to just inline some JSON/YAML, or reference a dashboard in Grafana’s marketplace, that seems to be what usually happens in the wild.

Mixing things up

We want to make our process for working on this public — once it would be something more than just tossing people into a butter churn of ideas. To that end, and with the above challenges in mind, we’re gearing up to start some prototyping. As with all prototyping, we can’t know exactly where it will end up. But there are two overarching goals:

  • Smooth out and better manage the full lifecycle of a mixin, from production to consumption, while retaining friendliness to “*-as-code” ethos.
  • Decompose the relatively big things mixins provide — e.g., whole dashboards — so that when folks move beyond the defaults provided by mixins, they have bits and pieces they can actually reuse, rather than just having to copy/paste.
  • Make the mixin production as useful for internal systems as it is for OSS ones.

It’s possible that this may mean moving away from jsonnet ( CUE, anyone?). Or not! We’ll learn and share as we go, and bring all of this to the wider OSS mixin community. Mixins aren’t ours alone to change, after all!

Personally, I’m pretty excited about the possibility that our new free tier of Grafana Cloud could support the full mixin development lifecycle to OSS authors — “Grafana as IDE!” We think there’s tremendous potential here to ease operational burden, as well as make the whole discipline of observability more accessible to everyone creating software. Let’s democratize those metrics!