Event hero background image

The golden path: Everything you need to reach observability maturity

From helping thousands of customers with their observability strategies – and hardening our own while running Grafana Cloud at massive scale – Grafana Labs has a comprehensive vision for how to turn signals into action and what observability maturity looks like. But we understand that achieving maturity across your entire estate may seem daunting. The No. 1 request our Observability Architects get from teams is for best practices: tried and true paths to successful deployment, laid out step by step, and examples of usable automation.

In this session, Grafana Labs experts will walk through real stories and strategies from our customers who have achieved observability maturity at both small and massive scale: how they deploy, how they automate, how they enforce, and how they measure value. They'll also share some new initiatives within our product and educational programs to make realizing maturity much less daunting.

Brandy Smith (00:00):

Hey everybody, I'm Brandy Smith. I'm one of the leaders on our observability architecture team, and I'm joined by Jeff.

Jeff Freeman (00:05):

Hi, I'm Jeff Freeman, VP of Customer Solutions, which is a totally made up name, but it's basically our professional services team, our SE team, our tech support team, and our observability architecture team.

Brandy Smith (00:20):

Awesome. So let's set the scene here. Picture this, it's lunchtime and a customer is on your app trying to order food, but the order fails. Here's the kicker. On the backside one team's metrics are green, another's logs are delayed, and the alert that should have fired didn't. Frustrated, the customer opens your competitor's app and they place an order there instead. To the customer, it just looks broken. Now it might seem like one loss transaction, but in reality it is thousands of lost sales and a hit to your brand's reputation, all because observability was fragmented. And that's the reality for a lot of teams today. Observability is everywhere, but it's fragmented. You've got siloed data, the ownership is unclear, and the context lives in tribal knowledge. We've normalized this chaos, but the truth is, if your strategy is fragmented, your response will be too. And teams have different definitions of what good looks like.

(01:20):

They might be asking, what are we instrumenting? Are our dashboards any good? Are we doing too much or not enough? Now while one team measures uptime, another logs everything, and another ignores alerts entirely alert fatigue is real. We've seen this with platform teams, SREs, product owners, and even executives. Everyone's answer is different, and that's a problem. And the cost is around misalignment, alert, fatigue, bloated telemetry, and unclear return on investment. Excuse me. The answer here is to create a unified center of excellence to continually drive maturity, and we'll talk about what that means. So in organizations that have a mature observability strategy or the centralized observability, 79% have saved time or money by doing so. You've probably heard a ton about our observability survey and that's where this comes from.

Jeff Freeman (02:13):

Alright, all right, so let's talk a bit about Golden Path framework. So this is kind of the center of today's discussion. Last year, just a little bit of history, we had a customer that came up to us and they had this large global company, this really broad observability platform, 79 different countries, and came to us with a really simple question, which was how good are we at observability? Are we doing the right things? And we were a little bit stunned by that. We didn't really know how to answer it. So that really encouraged us to create this whole thing around the observability maturity model, which is now live on our website. It's really actually quite simple. You can go in there, you link in it would basically ask you a bunch of questions and then it would pinpoint you along this observability journey as either you're reactive, proactive, or you're systematic, right?

(03:07):

So that was a really interesting step. We can plot you along the journey. And then that same person came back four months later and goes like, great, I'm in the red zone, apparently we're really reactive. How do we get to green? I was like, well, that's also a really good question we should probably have the answer to. And that's inspired us to build out what we're calling the golden path, which is real simple. Contextually all that is a set of best practices that we're going to harness together to give you insights as to how to improve your observability platform. So that's currently our work in progress, and we're starting with this one key area around center of excellence. And this was an easy way for us to start because a lot of questions came out more on the operational layer versus the technical layer. And this one customer, I'll let you speed read this quote, but came to us and said basically they're having really big problems in organizational structure, productivity velocity, and by going through and taking time with them to build a center of excellence, they had marked improvements in all areas. So Brandy, what is exactly a center of excellence?

Brandy Smith (04:20):

So a center of excellence is essentially it centralizes observability efforts and it helps you get real time end-to-end visibility into their systems. It can be, we'll talk a little bit about what different center of excellence looks like, but it doesn't have to be one dedicated team. It could be a federated team or a hybrid team. It really depends on what works best for your business. It serves as a strategic function. It helps enable better decision making, gives faster incident response and continuous improvement. It's not just about improving operations either. It's about driving the right culture. So you'll hear me mention culture a few times today. It's super important to get people, process and technology right in this.

(05:03):

And so what does it actually look like in practice? Some of the functions of a center of excellence are obvious things like standards and best practices, education and training, looking at how to make the most of your tools and consulting and supporting teams. But there are other functions that are not so obvious, but they are equally if not more important. So this will be things like tooling, templates and guides, building and maintaining some of the tooling used by the teams. Maybe you're making dashboard templates or library panels, admin and vendor management. This is huge to have a centralized team to do this. It's not just admining the system itself, but looking at things like licenses, renewals, identity and access management as well. And then community, again, I told you this is going to come up a lot today. It's important to build a Grafana community, allowing us to exchange ideas and share information as well as success stories. Many of the customers that I've worked with in the past have a centralized team that then runs trainings for the broader folks using the platform, whether or not they're on the same team.

(06:12):

And then lastly, in large organizations, there may be a large rollout plan to replace existing tooling with Grafana and someone needs to plan and track this, and that's also another place where the center of excellence comes in. And so bringing it all together, this is a high level view into who does what. Like I said before, they can be centralized or federated or hybrid team. It really is about what works best for your organization. Different personas can be shared in a smaller team, so maybe wearing different hats as you see here. And the size truly depends on the organization. Sometimes the project manager or PM role may only be applicable in larger orgs. Smaller orgs just don't have those resources. And you could also have multiple SMEs in a large group. And so by show of hands, how many of you would consider yourselves to have an observability center of excellence,

Jeff Freeman (07:06):

Brandy? Brandy did a lot of work in actually establishing what a strong COE is at Grafana. So if you have an opportunity to talk to her later in the conference, definitely stop her. She's got a lot of great information on how to do it. So the golden path, needless to say, our marketing team is having a lot of fun designing our T-shirts around this. But from a rollout perspective, it is really about best practices and how we bring 'em all together. And so in order for us to do this effectively, certainly we want it to be outcome driven. We don't want to have the talk about the talk. What are you actually providing? What are you presenting? And how can we actually instrument this in our company? And we want them to be real. So the authors of all these different best practices, which will be shown out through white papers, how-to videos, demos, all different kinds of vehicles, but they're authored by our professional services team.

(08:04):

They're authored by our observability architects, like folks on Brandy's team. They're authored by our solutions engineers and our tech support teams, but also authored by our partners and customers who, in effect, actually have harnessed some really great best practices and will participate with this in collecting all this information. It's going to be available on our website. And the nice thing about it too, it's not going to be this long boring novel that you have to read page to page. It's going to be very modular. So for the areas that have specific interest or relevance, if you want to really focus in on tracing or if you really want to focus on COE, you'll be able to do so. You'll be able to pinpoint the areas that are most efficient for you and relevant. So earlier in the year, I asked my team a simple question.

(08:53):

And again, this is professional services, our tech support teams, our observability architects. I said, where do our customers get stuck in adopting Grafana Cloud? Where do they fall down? Where do things stall out? It was really interesting. They sent in just all the different topics and areas that they felt like their customers were experiencing. And it was interesting also that we could actually fit them all in these three nice little boxes, like one, certainly technology. So an example of technology best practice that we're working on is optimizing data telemetry pipelines. So from various different endpoints and applications from instrumentation into cloud, egress, ingress, add a cloud across VPCs. There's a lot there to know how to optimize things like fleet management with Alloy. That's a great example of a technological best practice white paper that we're building and we're going to be releasing soon. But operational excellence was another big topic.

(09:57):

So the things that Brandy was talking about for center of excellence, nothing to do with product at all, but really about how do we organize effectively internally. So Center of Excellence is a great example of that. Building power users is a great example. Interfacing with InfoSec was a really popular one, especially for things like log management and security access. These are things that have nothing to do with product, but we have to actually bring these things forth because you're going to run into these pitfalls invariably in your observability journey. It's our job to help you through it. And then this is one of the ones I found most interesting, which ties into our strategy in terms of how we're moving toward business observability was business insights. So you saw from Ocado, they were talking about in the end here, kind of the core benefits of moving to Grafana.

(10:50):

But where a lot of customers stall out or struggle is like how do you actually materialize the metrics of that? So what does that actually mean for your company? What does it mean for your business? And it turns out it's actually pretty hard to do, right? Depending on how large a company is, how you've already reestablished these metrics. So some of the efforts that we're working on from a business insights perspective is to arm our customers with dashboards like tactically, we can address things like, okay, how available as our observability platform. What are some of the cost savings we're yielding from things like adaptive telemetry year upon year? How are we enabling our teams and our users, how effective we are that what are some of our MTTX timeframes across these applications? Those actually are fairly simple. But then if you take even a greater leap and then say, okay, well what does this actually mean to the business? So that's even a higher level of abstraction we can work with you on and saying, okay, we are shipping orders faster because of our observability platform. We're able to mitigate churn because of our observability platform and the insights that we're getting, and we're able to actually increase our customer velocity. So those are examples that will have very industry specific focus areas that we can start to bring to light.

(12:15):

And again, as we talk about it, a certain customer we just consulted with from a co OE perspective, just by having this more advanced structure, this more kind of standardized approach, they had a real impact. They were able to obviously improve their MTTR as demonstrated here, but actually a true financial impact that they also received. And in fact, they built a dashboard that they show back to the business. So again, the business insights are really important, especially when you think about things that happen that are beyond your control. When you have leadership changes, when you have things like acquisitions that occur, it's always going to be important to have these metrics of success at your fingertips. So how do we get started, Brandy?

Brandy Smith (13:01):

Yeah, so we are going to have a Golden Path Hub. It's coming soon where you'll be able to see all of these resources as we launch the different modules of the Golden Path. In the meantime, you can find an observability architect here today. If you have questions, find one of us. And some of you may already be working with our observability architecture teams, so definitely engage them if you've got an OA that you're working with or your Grafana account team. And so the first module of this, as I mentioned, we're going to be releasing a center of excellence. We've got a few resources for you already available. That's going to be things like a tactical center of excellence checklist. How do I go and build this thing? We've also got our maturity model, which you can take today. Some of you did and had a session with our OAs and some other resources here.

Speakers