Planet-scale dashboards at Google with Grafana

Google runs hundreds of thousands of services globally, often interdependent and with shared telemetry. At that scale, classic federated observability—a platform team providing foundations and/or building blocks for each team to assemble on their own—does not work anymore.

In this talk, Katia Giarda, Software Reliability Manager at Google, and Carl Bergquist, Principal Engineer at Grafana Labs, demonstrate how Google managed to cut toil dramatically while providing best-in-class monitoring out of the box.

The presentation covers:

The unique circumstances that contribute to Google’s scaling problems
A data model for re-usable dashboards
Impact on both configuration overhead and incident response
Looking beyond dashboards, how such re-use can be facilitated in the broader observability space

Katia draws on Google’s research paper on planet-scale dashboards (to be published) and more than a decade of experience in SRE. Carl demos how Grafana has incorporated these ideas to enable Google to replace its existing dashboarding tool with Grafana.

Carl Bergquist (00:00):

Thank you. You want to have that? Thank you. My name is Carl Bergquist. I'm a Principal Engineer at Grafana Labs.

Katia Giarda (00:08):

My name is Katia Giarda. I'm working in Google within the monitoring team. And with this presentation, we want to show you what means for Google planet-scale dashboard. At Google scale, it's definitely impossible to manually maintain tens of thousand dashboard or hundred of thousand systems. This would become immediately cost-prohibitive and this is the reason why Google developed this internal system that is planet-scale dashboard as a single pane of glass for observing the health of its internal systems. I will walk you through the journey that made this a reality. To do this, we're focusing on three main areas. The first one is about Reuse, so to reduce the toil of maintaining this huge number of dashboard. The second one is about scalability because when you have tens of thousand dashboard, you need to ensure that your observability system is scaling in terms of performance

(01:13):

when you are monitoring this to ensure that you have a quick response to outages, for example. And last, an Outlook, what you can do with the system. It's true that in Google, we are working at large corporation scale, but we believe that this can be useful even for smaller realities. So to do this, let's take a scenario you may be familiar with. Let's assume that your developer team put together in production today a new, wonderful, shiny system and they manage to put it off all the monitoring stuff. So, they ignore all the persistent e-mails to set up the monitoring, they postpone the dashboarding work, relying on the fact that the system is stable. And then, the inevitable happens. At 3:00 AM your pager screams to life and after the initial shock, you are thinking, "Oh my god. So, how much manual toil is ahead of me?

(02:10):

What should I do?"

(02:13):

And this is where the magic happens. Imagine that you're landing on your monitoring page and instead of scramble to find the right query or the to the back of dashboard, you can select only the systems or the services that you are looking for and immediately, you get provisioned with high a signaling dashboard, fully provisioned, tailored for what you're looking for. Would be this amazing? Or, let's have a look to how this dashboard can be. This is a screenshot from an internal observability dashboard in Google. This is a standard dashboard, let's say, for a frontend job showing the HTTP metrics in which you get an overview and a list of out-of-the-box dashboard that you can navigate on the left-hand side. And on top of that, you can have additional capability like filter and group by a specific dimension. Now, this looks amazing, but it can be a problem.

(03:17):

Let me explain why, but before, let me give you some numbers. In Google, we are dealing with 190K Googlers according to the last earning cost from 2025. That is equivalent of the large urban center. We are dealing with a single monolithic code base repository. We are speaking about billions of lines of code the list will use and sharing quota. And on top of that, we are dealing with internet scale application and systems. So, this means that we are leading with billions apps distributed all around the globe. Now, keep in mind these numbers and let's go back to our dashboard. Imagine that somebody creates a dashboard for each frontend service for this specific metric. Fine. The dashboards were ready, fully containing information needed and now, what the engineer does? "Okay, I need this dashboard or my backend, for my middleware," or for whatever you have.

(04:19):

Then, you take the dashboard, you copy and paste the dashboard or... Then, imagine this happening at Google scale. Only for this metric, we are speaking about order of magnitude of 100K dashboard. This is something that we cannot sustain. It's impossible. You're getting lost immediately. So, this is the kind of problem that we need to solve now. In which way we are going to solve this kind of problem? So, in Google, we have something that we call dimensions. You can call them, I think in Grafana terms, variables or template variables. Imagine that you are able to provide a fully qualified query only at run time when you inject the variable into your dashboard. You're making your dashboard reusable because you can use the dashboard for all the service it apply for. And this is the way in which you can immediately scale in terms of reusability of the dashboard.

(05:20):

This looks great, but that is another problem behind, then this apply to each and every system that you have. So, what if your system is running on a Java virtual machine and this dashboard apply to something else? It's not relevant, right? So, you want to be able to figure out in the out-of-the-box dashboard that you have on the left-hand side, only the dashboard that are relevant for you. So now, we have two problems. How do we inject the scope, the job, and how we filter out the dashboard on the left-hand side.

(05:54):

And this is where planet-scale dashboard plays an essential role. I told you that we have dimensions variable. So, we lift up one dimension to be more equal than other and we call this dimension scope. The scope at the end is the filter, is the filter for your job. Or better, is the filter for whatever you're interested for your investigation, is a filter for your scope, and this is the reason why we call this scope. So now, whenever you select the scope, you get the dashboard fully provisioned with all the navigation field for the scope that you have. The second problem, how we filter out the dashboard relevant for your scope, we solve in a slightly different way. We ensure that your jobs expose the property related to himself. So, if you have a job that runs on a virtual machine we'll make the job expose a metric, let's say, run on JVM, and we ensure that whenever the scope is selected, the relevant dashboard for the Java virtual machine are showing up.

(07:06):

And this way, you get the filter out of the box for ten of thousand dashboards, and you have a list of canonical dashboard that you can reuse for the full organization.

(07:21):

What do we get so far? First, the dashboard are reusable. So, we reuse the dashboard for any similar system and there is a single dashboard for concern. Second, the dashboard are navigable. The list of dashboard are still filtered by the system that you're looking at and we make this generic. Let me give you another couple of quick wins that you can get with the navigable dashboard. If you can inject the job into the URL parameter, when your alert fires and you get paged, you get a deterministic link to the dashboard without having to search anything. And with the same parameter injected in the URL, you can easily navigate through all the dashboard that are related to this specific job without doing anything else. That saves time, money, maintenance.

(08:18):

A quick words that I have been told to tell you. So, first of all, I'm referencing a huge number of technology, but it's just to make sure that it's clear that it's broadly applicable. It's not a sales speech because this applies only to what we have for internal observability at Google. And this is a preview of a paper that is in submission, so it can be imperfect. It will be published soon, and then you can get access to the paper and get something out of that. Now, where Grafana enters into the game. Let me go back when Google started its observability journey. So, at the time, there was nothing on the market and Google decided to build everything itself. So, we have the frontend, we have the backend, we have the storage, everything. But in the last years, there was a strategic shift because the world has changed.

(09:13):

We have now a lot of industry standard in which Grafana, for example, is leading one of these kind, the industry standard. And it can be beneficial for Google to make sure that we are aligning with the industry standard for multiple reasons. First of all, we can unify the flow, ensure that the experience from Google is taken into an account. The experience from Grafana for example, is taken into an account and create a new, let's say, enriched platform.

(09:43):

It allows teams to cooperate easily because people are coming from the same technology stack will be easier to share experience, template, collaboration, and it's easier to onboard people if they're already familiar with the technology stack. In which way we are doing this? So, right now, we have a partnership. There's a partnership between Google and Grafana and we are running Grafana Enterprise for our internal usage with broader use cases and largest use base. The reason why Grafana was well-positioned for Google was the architecture is modular and this is an excellent entry point. They have a lot of visualization capability that can be very useful for Google and Google can shift the focus on developing the UI, for example, towards things that are more unique for Google and make sure that we are using industry standard and share this broader with the community, and we focus on integrated solution and services.

(10:46):

Right now, we are using this internally. We are collecting feedback from users, many feedback to ensure that all the feedback is captured and we can deploy together in sync with Grafana in even a better version of the system in order to ensure that all the experience coming from Google is shared across the community.

(11:09):

Before handing over to Carl for specific detail in what we're doing right now together, let me tell you something that I personally believe that is the proof why the system is needed. So, in this chart you can see the growth of the number of system monitoring with the traditional dashboarding system and with the planet-scale system at Google. Without this system, would not be possible to monitor the number of system tracked here. You see the growth is more than linear. It goes towards to be exponential and would not be any way to monitoring this without a system like that. And with this, I think I will hand over to Carl.

Carl Bergquist (11:57):

Thank you. So, what I want to cover now is how we implemented the idea of scopes from Google into Grafana. So, set the scene a little bit. Let's imagine you're Google. You have one dashboarding solution, you have 50,000 engineers, hundred thousands of services and that number is just increasing as Katia just mentioned, very rapidly. And you have an ocean of metrics. You have all of these shared infrastructures and different layers of software, hardware, and other internal sources. So, wouldn't it be good if there was one dashboard for each layer or at least one dashboard built by experts? And wouldn't it be really good if when you look at that dashboard, you only saw the telemetry data for your service? And that is kind of what scopes enables with just one click. So, take it to Grafana and Prometheus terms. A scope is a named set of filters.

(12:54):

It could be any set of filters in Prometheus, but for the sake of this presentation, it's usually easier to think about it as namespaces. The user journey when using scopes doesn't start with a dashboards. It start with with you, as a user, selecting the scope. You might not know what metrics you want to look at, but you do know what system or namespace or scope you want to care about.

(13:20):

Grafana then injects the scope into the dashboard query at runtime before it's sent to Prometheus. And by extracting the labels that you care about right now from the dashboard and injecting them at runtime, the dashboard becomes much more reusable. And the dashboard that is built by experts, they don't have to care about what labels the metrics eventually have that they want to graph. Those are disconnected. By moving the filters outside of the dashboard query, you can look at metrics in various different angles. So you might care about Go metrics at a cluster level if you're rolling out a new Kubernetes node type. You might care about the meta namespace level if you care about app application is having problem. Once you know what service within that namespace is having problems, you might want to look at the job label. And all of those scenarios are covered by the same reusable dashboard.

(14:21):

The scope that you selected initially in the user journey also stays when you navigate dashboards. So, if you selected this namespace, you can look at Go metrics, RPC metrics, Kubernetes metrics using the same filters always applied. So, let's see that in action. So, the scope selector exists up here on the left side, so let's zoom in a little bit. And how you select your named set of labels? So, you as a user don't need to care about the actual labels, you just care about the name. What happens then is that the labels are added to the filter box. The dashboard themself doesn't care about that. They're just generic.

(15:25):

So, exit edit mode and then... Oh, no we don't. One second. Exit edit, and discard. But if we look at the actual query sent to Prometheus, we see that the namespace is injected, and we do that by parsing the Prometheus query in Grafana and injecting the labels. The new filter box also allows you to inject filters that you as a user just want to apply. So, if we take something like... Let's take cluster, those are also injected. So this dashboard, it's much more reusable because it doesn't care about the labels used in your system.

(16:19):

We also added support for adding group by on the fly or at runtime. So, if I want to group by job, I now are able to aggregate the process memory based on the job that I'm filtering on. The menu here on the left side is also. It add navigations to the dashboard relevant for your scope, and when you navigate these dashboards, both the scope and the applied filters remain selected. So, this makes it very easy to drill into something, navigate different dashboards and maybe remove filters if needed or not. But it's all stays because that's the system that you care about.

(17:13):

And I know some of you maybe felt like the clicking journey here in the beginning was a little bit cumbersome, and Google engineers very much also. So, we added support for finding scopes through the quick navigation item as well, so you can apply recent applied scopes or you can just navigate to scope using the keyboard. So, this is a quick way of zooming in on the infrastructure or processes you care about, and then having the relevant dashboard show up for you. So, let's go back to the slide please. Sweet. And the dashboard navigation item list here on the left side is really one of the key features of scopes because it allows you to quickly navigate different dashboards, as said. And you might think, "What's the difference between these links and the normal dashboard links?" And the fact is that these links are managed automatically by Grafana.

(18:17):

Based on the selected scope, it figures out if it has relevant metrics for you or not. So, if you selected a namespace and the Go metrics in that namespace, you will see Go dashboards. If there's Java metrics, you will see Java dashboards. So, you only see the dashboard relevant for the scope you selected.

(18:38):

We do that by adding metadata to dashboards that describe what metrics they are designed for. And then we check if those metrics exist within a scope in the background. And this is pre-computed, so getting this list of links is blazing fast and this query is just like example for how to do it. At Google scale, this kind of starts to break down and there are other ways of doing this, but this is how we're getting started in Grafana. Managing scopes is also something you can do based on automation. So, you do it based on metrics queries. So, if new clusters, services, or namespaces start exist as a result of the metrics queries, new scopes are created for you. So, neither you as an engineer or the observability platform team needs to think about it. You need to think about the query that defines what scopes you want generated for you, which makes it quite flexible depending on how you decide to run your infrastructure.

(19:44):

We don't need to take care, like, have an opinion about that at all. It's all going to be tailored to you. Once infrastructure is decommissioned, it will eventually also delete the scope based on TTL.

(19:59):

So, what scopes gives you at scale is a way of zooming in on and only see the metrics and dashboard relevant for you. It greatly increases reusability of dashboards and it allows your platform observability team to manage these as higher-order functions. They care about how to generate scopes and how to connect scopes and dashboards, but not each of them individually. And I'll also ask serious questions. When are scope suited though? Because this is a feature designed for high scale. So, it's really designed for companies who have one metric database. A lot of these experience is based on the metrics, so it has to be one, otherwise it's going to be quite cumbersome. You need a central observability team that can manage these configuration at scale. Someone needs to understand enough about the infrastructure that they can configure it. They don't need to understand all of it, but enough.

(20:54):

It's also more suitable for organization that have experts building dashboards for others. So, if your organization have all of the teams completely siloed and independent, the reusability is less compared to if you have experts building layers or services for each other. So, if you've had a database team who build the database dashboards for others or own that, then this is more useful for you.

(21:21):

Scopes are currently in our experimental stage. Please sign up if you're interested. You can also find a QR code that allows to ask us both later. It's quite early in the development process, but we're just too excited to not share it with you. It's also going to be a Enterprise and Cloud feature. But that said, most of the feature that we're building based on Google, based on feedback from Google, are going into open source. So, the improved ad-hoc filters, variables, group by variables, drill down per panel and section-level variables, those are all based on feedback from Google and they're going into open source. So, this is not just excited for us and Google, this is excited for the community as whole because Google are making us up-level dashboarding in general. So, we are very excited about this, obviously. The Google book is a big part of Grafana culture and we're very happy to take the expertise from Google and build it into Grafana and make it available to the community.

(22:29):

Scopes enables you to drill down on what's relevant just for you at high scale, and ad hoc filters and group by variables make dashboards much more usable even for your home lab.

Katia Giarda (22:42):

We are excited too to have this collaboration on Grafana. We can level up the whole community by introducing this new concept and we can work together to get a better observability board.

Carl Bergquist (22:56):

I think that's it. Thank you.

Speakers

Carl Bergquist
Principal Software Engineer — Grafana Labs
Katia Giarda
Software Reliability Manager — Google

Planet-scale dashboards at Google with Grafana

Speakers

Carl Bergquist

Katia Giarda

Still have questions?

Get every update