How Delivery Hero migrated 2,000+ users to Grafana Cloud and reshaped observability at scale

Company: Delivery Hero

Industry: Retail

Delivery Hero is one of the world’s largest online food ordering and delivery platforms, operating across 70 countries with 11 brands, 3 million riders, and over 1 million restaurants. From Berlin to Seoul to Buenos Aires, the company manages complex, independent tech stacks and teams across its global brands.

Challenge

With multiple brands using different observability tools like New Relic, Datadog, and Grafana – Delivery Hero lacked a common diagnostic language. Each brand operated in its own silo, with disconnected data and unpredictable costs. Engineering teams struggled to collaborate or share capabilities globally. The company needed a unified, cost-predictable observability platform that could scale across 3,000+ engineers and enable cross-brand collaboration.

Solution

Grafana Cloud became the foundation for a global observability migration anchored on OpenTelemetry. The project unified metrics, logs, and traces while making observability sustainable and developer-friendly.

Grafana Cloud provided a single pane of glass for all 11 global brands, connecting teams across regions.
OpenTelemetry replaced vendor-specific SDKs, simplifying data collection and ensuring vendor neutrality.
Grafana Labs’ Professional Services supported automated data and dashboard conversion from legacy systems.
Grafana Mimir enabled scalable metrics ingestion for 20+ petabytes of time series and logs.
Application Observability, Adaptive Telemetry, and Cost Attribution dashboards helped optimize usage, prevent runaway costs, and empower local teams to manage their data efficiently.

“We did accomplish a lot with this migration. We were able to get everyone into a single pane of glass. Engineers in South America can now talk to engineers in Berlin or Korea and so on. Everybody uses the same system.”
– Andy Howden, Engineering Director

Impact

The migration to Grafana Cloud transformed Delivery Hero’s observability across technical, operational, and financial dimensions.

Reduced observability costs from 15–25% of its total cloud spend to ~6%, creating predictable budgets.
Unified observability across 11 brands and 70 countries, fostering cross-team collaboration.
Improved developer efficiency through shared dashboards, alerts, and automations.
Simplified management of petabytes of logs and metrics through OpenTelemetry-based standardization.
Increased innovation speed as teams now reuse best practices and observability patterns globally.

“The core punchline at the start of this project, we were looking at around 15 to 25% of our observability cost relative to cloud cost spend, which is quite a large amount if you think about the amount you're spending on AWS...and we're now looking at around 6%. We’re overwhelmingly in a better position than when we started. We were able to deliver on the project goals and meaningfully reduce our cost.”
– Andy Howden, Engineering Director

Andy Howden, Delivery Hero (00:03):

Oh, good morning everyone. Can we hear me? Yes, we can. The microphones work. This is going well so far. Hi. Hi. I'm here to tell you a bit of a story. It's a story about how we migrated a whole bunch of people from a range of these different observability providers to Grafana Cloud. It says 2000. We ended up going a little bit further. This was around 3,100 people so far. This is what I was checking in the dashboards this morning, and I'm here to tell you the story of the good parts and the bad parts and the tricky parts, how all of this stuff worked in practice. Clicker works we're good. There's two things that you need to know in order to make sense of this talk. The first is who I am. I am the engineering director of an organization called Site Reliability Engineering.

(00:50):

There's a lot of that going around today. Basically we take care of things that make us more reliable. I live in Berlin. I was raised in Australia, so I have a very strange accent. That's okay. If you don't understand me, just sort of take a deep breath. It will be recorded. The second and far more important thing is about Delivery Hero. Now everyone's heard of delivery, actually hands up. Who's heard of Delivery Hero or Glovo? PedidosYa, Food Panda, any of these other brands? Fine. Okay. It seems like it's not that popular in this country or with these lovely people, but Delivery Hero is a very large company. It operates across the world. It has 70 different countries, 40,000 employees, 3 million riders. The people who actually deliver the food to your house, a million restaurants, 800 Dmarts, and we do north of 10 million orders per day.

(01:40):

But the most important part of this slide is that little section there that says 11 brands. We operate 11 of these different kind of companies across the world. Each of these companies has their own management stack, their own tech stack, their own language, either language actually speaking in that country or computing language, their own way of working, their own style of doing that kind of work. And what we have to do from a central perspective is deal with each one of these companies. That makes this incredibly complicated. As a migration, we had a problem. There is a reason that we moved to Grafana Cloud. We didn't just kind of spontaneously wake up one day and decide it was a good idea. The core challenge that we have is that each of these companies that operate across the world, some of these companies are building these truly world-class, these really exciting capabilities, and we want to take these capabilities from one of these company and we want to provide it to the rest of the companies that exist around the world.

(02:31):

So if somebody does something amazing in South America with the Pedidos team, then we can provide that to the Glovo team or we can provide that to the Talabat team. Or if we're able to build some central capability like Grafana, we're able to expose that to the rest of the world. But we had a fundamental challenge because each of these are separate companies and because each of these kind of operate very independently, they didn't have a common way of looking at their production system. So a kind of common diagnostic language. We have New Relic and Datadog and Grafana. Each of those are separate Datadog accounts that never talk to each other. So if you're thinking about trying to take a capability from one part of the organization and expose it to the other, it's essentially like working with a separate company and it just makes these things much harder than they need to be.

(03:18):

So we wanted to get everybody onto the same page. Now I'm going to cheat a little bit. I'm going to give you the punchline of this whole talk up front. I know there are many management people in here, so I'm going to make this sort of fairly numeric. We did accomplish a lot with this migration. We were able to get everybody into that single pane of glass. Everybody now uses the same system to observe their stuff and they're able to talk to each other across these organizational boundaries, the South Americans to talk to the people in Berlin, people in Berlin can talk to the people in Korea, so on and so forth. The other thing that we aim for as part of this project was financial sustainability. There have been many hints to some of the challenges that we have had with other observability providers In making these things sustainable, we would suddenly find that people had turned on a capability that they found very exciting and they'd left that capability running for a couple of months and they'd suddenly get this bill and it would be problematic or they'd be sort of surprising ways in which the pricing worked or other bits and pieces.

(04:19):

It became very hard to make the financial aspect of observability predictable, which is the most important part, how much it costs. You can budget for that predictability so that you can fit within the 12 month budgets that companies work for is an absolute must. We were able to give it everything on OpenTelemetry. OpenTelemetry is good. That's part of what I will talk about. We were able to do an enormous amount of cleanup. Turns out that developers create an enormous number of dashboards and alerts and things like that, and they look at them for a good two weeks and then never look at them again. So we just clean most of that stuff up. And lastly, most critically, because this is a large integrated company, we're looking at distributed tracing as our next key technology to integrating against these things to make working with this software so much easier.

(05:05):

The core punchline at the start of this project, we were looking at around 15 to 25% of our observability cost relative to cloud cost spend, which is quite a large amount if you think about the amount you're spending on AWS and Google Cloud and whatever, and we're now looking at around 6%. Now to be clear, this number cheats a little bit. They're not directly comparable, but it gives you an idea. We're able now we're in a position where things are much cheaper and much more sustainable, much more predictable after this migration to Grafana Cloud and after the move from these proprietary SDKs towards OpenTelemetry. But while that's the punchline, I'm here to tell you about the journey. I'm doing this storytelling thing all backwards. That's fine.

(05:48):

The thing about this is it's not going to be a talk where I stand up here and I tell you all of the beautiful ways in which this went absolutely flawlessly and things worked wonderfully from the start and we had no problems. That's not what happened, and I would feel disingenuous standing up here telling you it's going to be very easy. It's not. I'm going to talk about some of the core challenges that we had along the way. The thing that I need to convey to you is because Delivery Hero is such a large and complex organization, most of the problems that we had, they weren't technical problems. From a management perspective, and I'm going to annoy all of my engineering colleagues by saying this, but a technical problem can be solved from a management perspective, by applying talent and time, and they're usually fairly predictable.

(06:36):

Either things work or they don't. Project problems cannot be solved in the same way. They tend to reoccur over and over again and people problems can compound enormously quickly as you're working with large organizations. So I'm going to tell you mostly about the project and the people lessons and the technical lessons we can absolutely chat about. I'm a huge nerd. I would really enjoy that. But in another conversation, we ran this project in three phases. I would say. We had this project organized from the start. We were sitting there in, I don't know, it was February, 2023 or something, February, 2024. God, it was only last year and we had a plan. We were going to fully automate this migration. It was going to be perfect. We were going to take everyone's data and their dashboards and their alerts and we're going to lift them from one system.

(07:21):

We're going to improve them to the new system. It's all going to work wonderfully. Developers will love us. It's going to be great. In order to do this, we had to solve two problems. We had to solve the data problem. We had to solve the dashboard and the alert problem, the data problem. It was too expensive for us to do the instrumentation upfront. Instead, what we did is we took the instrumentation that was provided by these vendor specific SDKs and we converted it from that vendor's proprietary format into open telemetry. We then sort of further convert it from a Delta series to a cumulative series, enrich it a little bit in the same way that the previous providers did in their backend, and then we write that into Grafana Cloud, thereby sort of seamlessly, seamlessly migrating everybody from their previous observability provider into Grafana Cloud.

(08:10):

This is complex, but definitely doable. What is surprisingly more complex is the conversion of dashboards and alerts into something that is usable in Grafana Cloud. And the reason this is more complex is not only do you have to think about the construction of the data itself, which is fairly predictable, but the dashboards include the data. They include the queries to the way the data is represented in the previous system. They include the widgets that visualize that data in a specific way and that is composited into a dashboard that renders these things in a fairly constructive way. Alerts are less complex, I would say, but you still need to understand how the data was represented in its previous system and then how it can be represented in the same way in Grafana Cloud. So the absolute complexity of this was quite difficult. We managed to do most of it, and I have to say a thank you to professional services at Grafana Cloud.

(09:05):

This was a team we worked with at Grafana in order to make this whole thing a lot more achievable than it would've otherwise been. Here's what this looked like in practice. We had this beautiful timeline. It's a Gantt chart. Yes, I know I'm management. We had the data conversion work stream and the dashboard conversion work stream running in parallel, and then we'd have this integration period where we'd start using the data conversion and the dashboard conversion to check themselves against each other and then we'd pass these things onto users, whereas users would then check these dashboards and we'd be sort of away and free things are successfully migrated.

(09:38):

That is not what happened. What happened instead is what you see on your screen. We had these two work streams running. We had the data conversion work stream running and the dashboard conversion work stream running, and they started to run into challenges. We did actually manage to make this all work. Part of the technology that we use to make this all work as part of the OpenTelemetry project actually built by some people at Grafana, among others, it is the Datadog to OpenTelemetry conversion pipeline. So if you're looking at this yourself and you're like, ah, how do I do this? Then don't worry, a whole bunch of the lessons are already open source.

(10:15):

But the core challenge we had is as we were talking to the teams doing the migration, as we were actually having the discussions, look, are we on track and are you delivering things correctly and sort of is this all going to work? The teams would get back to us and say, of course things are going to be fine. Don't worry about it. We are working and things are happening as they should. We're making progress with the data conversion. Some of the dashboards look really good, everything's going to be fine. And we believed this right up until it was time to start integrating these things against each other where we checked and we started to find that there were problems, there were problems in the data conversion and there were many more problems in the dashboard conversion itself, and it was only at the moment where we started to try and compare these things against each other where we were starting to be in a position where, look, this project is starting to run late and then this project is getting very off track, and then we're in a position where this is getting very uncomfortable and here's where the first lessons start to apply.

(11:11):

If you are running any large technical project. To be clear, this has nothing to do with migration itself as a project. Every large and complex technical project will run into risks, and the most important thing you can do from a project management perspective is to understand if that project is on track. Sounds easy, practically a little bit more complicated. The way in which we do this now is very simple. Essentially we assign someone to figure this out. If the project is on track, they represent the success of the project and we ask them, are you in red, amber, or green? If they are in green, the project is good. We're going to be delivered on time. Everything is okay. If they are in amber, there are some problems with the project. That's also okay, you believe that you can start to fix this. If you are in red, the project's not going to work.

(11:57):

That's also okay. It is just the case that we need to change something about this project in order to be successful. Now, there's two patterns of projects that are going to fail. A project that is always green is the project that is least likely to be successful. The reason for this is because every technical project runs in a problem and if a project is always green, what happens is you haven't figured out what the problems are yet, so you need to go digging. The second and much easier case is a project that goes from green to amber to red, fine. Some things turn out to be more complicated than we expected. You can start to intervene and start to address things where a project is off track. You have to change what you expected to deliver within the project and you have to be far more aggressive than might be comfortable.

(12:42):

The thing about running these large projects is as you're moving through the project, something will start to go wrong and the developers will think to themselves, okay, great, this went wrong, but don't worry if everything else goes right, we'll be on time, and then the next thing goes wrong and then they think to themselves, oh God, okay, it's fine. If everything else goes right and then we get a miracle over here, then we'll all be fine. And then the next thing goes wrong and you end up in this position where in order to complete the project, you need a stack of miracles, which is just probabilistically unlikely to happen. So you need to jump in there. You need to change deadlines, you need to change what you're going to deliver. You need to change something in order to make this successful. And we did. We changed what we would deliver.

(13:21):

Essentially, we had to declare a bit of bankruptcy on the dashboards and alerts. Sounds bad, not so bad in the end, I will tell you why, but we delivered as much as we could. We were able to absolutely complete the data conversion stuff. The OpenTelemetry to Datadog to OpenTelemetry conversion works fine. It's running in production for many of these companies around the world today. The dashboard conversion, we got pretty well. We got most of the time series converted. We didn't get the dashboards, the logs or the tracing converted successfully, and we ended up at this point where we started handing these things to users and a user would get this dashboard that we would complete it and they would open this dashboard and they would see some things that were mostly correct. They would start to see their panels. Some of the panels would have information, some of that information would be missing.

(14:07):

There's a range of reasons for these things to be missing. Things like submitted directly to the API or not integrated against the SDK or things derived server side in the previous vendor or things deployed in some sort of strange place that we hadn't discovered yet. This is an amazing project to discover your shadow IT organization by the way, or there'd just be query differences between the tooling, minor bugs and the conversion. But users don't care. A user will open the dashboards, they'll see the dashboard, they'll look at it and they'll say, it's broke. And you'll say, yes, but you have to fix it. And they'll say, and here's a ticket. And it's very frustrating. We really tried to coach users on how to make this migration successful. We did these trainings, we did these, ask us anything. We worked with Grafana to get people around the world to sit in a room with 'em and actually coach them through this stuff.

(14:59):

None of it was super effective. The only thing that was meaningfully effective was in the converted artifacts themselves. In these dashboards, in these alerts and things, we provide the help directly in the dashboard. A user opens their dashboard, they see their half completed dashboard and they see instead of nothing, look, we know this dashboard has problems. Here is a list of the problems that it has. If you have this problem, don't worry, we're going to fix it. Here's the ticket. You can track it over here. If you have this problem and you're under this condition, then you need to check X, Y, and Z. You need to integrate your data source so it's successful and then the dashboard will work correctly. One of the things I was thinking about as I was sitting over there watching the early morning talk, this would've been much more effective if we were able to deploy some sort of magical AI tool in which we could enrich the context to say, look, here's a bunch of things that may be missing.

(15:50):

Here are the things that you need to check. Please go and do a bunch of stuff. But we didn't. Instead, we just kind of handcrafted this help. One of the lessons out of this migration, developers don't use most dashboards. We cut about 80% of the dashboards out of our migration scope, and it was fine. There was a series of core dashboards that everybody checks on a regular basis, but that's about it. And one of the surprising things is the out-of-the box capabilities from Grafana are really very good. They're able to replace an enormous chunk of what people were doing manually in their previous observability tools. So thank you very much to the AppOlly team. That tool has evolved enormously over the last 12 months. Through that, it was much easier for us to make this migration successful. The second thing that happened is we rolled these dashboards, the alert, everything else out to the users and we said, here is the help.

(16:38):

Here is the training, here is the guidance, but don't worry if you get stuck, you can reach out to us. You can lodge a ticket. We'll make this work. Now, if you offer this to developers, they will, we're talking about thousands of people, not just developers but the rest of the engineering community. And they opened these dashboards and they looked at them and they said, great, we have questions. And we said, no worries, but we were immediately overwhelmed with the sheer amount of support that we had to deal with, and this support varied quite a lot on its quality. Some of it was deeply technical. We had some very smart people who understood the details of the migration and the inners of Mimir and how this transforms, and it was amazing. It was able to point out bugs that we had, but we had a lot more feedback, which essentially said, my migration doesn't work.

(17:24):

Please fix it for me. The problem is the amount of tickets that we have just overloaded, the support capability that we have. As users started to get no answer for their project, they started to get even less confidence in this migration, and we were in this position where we'd lost the confidence of community to make this all work. Obviously, this turns out okay, don't worry, I'm getting there. Thank you very much. By the way, we have people at Grafana that we rely on in order to make this work. There's one of them called Seamus, and he does an amazing job of making sense of our entire complexity and bringing us to the support that we need. So here we are. We're in the darkest hour of our migration and we're trying to make this whole thing work, and one of the organizations that we were working with turned around to us and said, look, you need this migration to work.

(18:10):

We said Yes, and they said, we need this migration to work. And we said yes. And he said, what we can do is we can supply some people from our organization so people who deeply understand the context of our organization and we'll tie them together with you and then together you can see what you can do in order to make this migration work. Essentially, we built a kind of tiger team or specialist team for this organization and this migration. This was fundamentally transformative for us. What it was able to do, it was able to take the enormity of feedback, all of this user context and these user problems, and it was able to boil it down to a much smaller set of issues. Additionally, they were able to take the kind of context of that organization and they were able to start to address just themselves some of the problems of the migration so that users would start to see things work and start this to become more successful where we were having conversations with users instead of us being a central team, being please for God's sake, adopt our tooling.

(19:03):

We had a friend, we had an ally within that organization who would be part of that conversation. Who was staked on the success of this migration and who was able to overcome some of those less technically deep user transitions or user objections? Special thanks. By the way, this is an internal team. There's a guy called Sebastian, another guy called Gabe. They were the people who put together this team, and this was the redemption arc for us. This was a way in which this migration transformed successfully, and at this point we were pretty much complete. We get to the end. We've converted all the dashboards and alerts, we've converted all the data. We have to actually hand this to users, and then users need to review these artifacts and accept them so that we can sort of switch off their old observability providers, make this whole thing work.

(19:50):

This one is pretty funny. Grafana is an amazing tool for bringing all of the data together in one view. So immediately developers took advantage of this and they took their old observability provider and they took their new observability provider, which is Grafana Cloud, and they put them together in the same graph and they overlaid that data and it looked mostly correct. It looked just slightly different, and the reasons for these slightly different are very legitimate things like just the way in which rate is calculated is slightly different. There's a small delay in the way in which we convert all of the data. There are very minor differences in these tools, but again, we ended up in this conversation with developers about whether or not this was working and whether or not this would be successful. We had to again, shift the conversation. We did go really deep on understanding the internals of how Mimir works in order to understand whether or not the problem actually existed or whether or not this was just a, tools are different because tools are different, and I want to say very much thanks.

(20:47):

The Mimir engineering team, some of whom we spoke to at length about a bunch of these topics, but we were asking users the wrong question. The question that we'd asked users when we started this migration is, look, let's take your dashboards, your alerts to everything else that you were using in your previous tool, and let's put them in a new tool and can you compare them and see if they look the same? Obviously they don't. They are in fact different tools, but the core question you want to ask if you're doing these migrations is not, do these tools look exactly the same, but instead, can you debug your production applications? Can you reason through the failure states of what you're talking about? Can you identify a bad release? Can you understand when there's a memory leak? Can you understand when your error rate spikes? Those are the questions that actually matter in order for you to be able to reason through your diagnostic tooling.

(21:34):

Migrating the dashboards, migrating the alerts is interesting, but not the success of whether or not these migrations actually work. Only once we'd anchored in this conversation, we'll be able to make progress and say, look, as long as you can debug your production systems, you're good. Things are working enough as they should, and that was it. We're done. As I mentioned, we have successfully migrated these people, these 3,100 engineers, this multi-billion dollar business that delivers a bunch of food around the world into Grafana Cloud well enough that we still deliver a whole bunch of food, but it was a journey. It was a very painful journey in which we had to learn some of these core lessons the hard way, and I think the obvious question that you should be asking yourself at the moment is if you are doing the same thing, if you're looking at this journey or if we were to do the same thing, would we do it again?

(22:27):

I mean, it would be weird if I said no. It would be odd if I stood up here at ObservabilityCon, I was about to say GrafanaCon and said, no, it was not worth it. Of course, it was worth it. It was absolutely worth it. It was difficult. There were some challenges along the way, but we are overwhelmingly in a better position than we were at the start of this project. What I'm trying to do is convey to you the core challenges that we had along the way so that if you're looking at the same project from your own perspective, you can make this a lot less painful than we had ourselves.

(22:59):

We were able to deliver on the project goals. We got all of these entities in the one place. All of these different brands can now talk to each other about their production systems, and we're actually having people learn and develop. These new dashboards and shared alerts and automations that we're sharing across the company we're compounding value in a way that was just not possible before we were able to get all 3,100 people. So far, there's about 5,000 total into Grafana Cloud. We have Open Telemetry everywhere. If anybody has any doubts about the fidelity of the OpenTelemetry project, we are using it to ship millions or hundreds of millions of time. Series 20 petabytes of logs, 20 petabytes of traces, another two petabytes of logs. It works fine. It's just technology. It's good, and obviously we were able to very meaningfully reduce our cost, which makes the financial people very happy.

(23:51):

We also found a bunch of things that we just frankly hadn't expected. Grafana Cloud is evolving just extremely quickly. The last 12 months have been transformative for what the company is doing, and I'm really looking forward to seeing what they do over the next 12. We saw AppOlly, which has become kind of the central thing that developers will use to understand most of their production systems drill down for the things that they don't have dashboards for or for the metrics that they've kind of produced but forgotten about or things that are just outside the standard flow cost attribution makes my life so much easier from a central perspective to be able to say, dear team, you have spent a lot of money on logs, please pay for it or stop doing it. I don't mind which, but I'm able to now say, engineering manager, you need to address this.

(24:37):

And Adaptive Telemetry is really useful for especially intervening where a developer makes it very well-intentioned, but very expensive decision to suddenly publish all of their debug logs in a production system or suddenly multiply their time series by a thousand or something equal. That makes it very difficult to control costs on the other providers. Grafana themselves have been really, really very collaborative. We do a lot of work on product development. Obviously I'm standing here in front of you, which is an expression of that collaboration and their support team is just excellent, especially given the amount of stress and complexity in this project. So would I do it again? Absolutely, yes. However, there are two things that drove the fundamental complexity of this migration. The first is the deep embedding of SDKs of the previous observability provider in our production systems. This is absolutely what required those initial translation work streams, what introduced the risk to the project, and I can't stress enough if you're looking at the move to OpenTelemetry, start looking at it now.

(25:42):

Start looking at it directly. The technology's fine. It works with all your observability providers. We deployed it with Datadog and New Relic and Prometheus and everything else. It's good. You can start using it directly, and regardless of if you're thinking about doing this migration or not, it will save you enormous pain in the future. The second thing is an absolute nod to my software engineering forebearers. If you're looking at this project and you're a software engineer of some years, the first thing that should have stood out is we had a large requirements period, and then we had a fixed deadline after 12 or 18 months, and then we had these work streams and a Gantt chart and everything else, and if this looks exactly like a waterfall project because it's exactly like a waterfall project, our initial assumptions about the migration were not correct. Developers didn't actually need most of their dashboards.

(26:29):

A lot of the data that they fundamentally required, either they didn't or needed to be transformed in a different way. If we'd done this migration with just a small section of the business end to end, we would've been able to learn a whole bunch of these lessons a whole bunch cheaply than we needed to migrate in the whole organization at once, and this would've made the whole experience far less painful, and that's it. Migration success eventually. I want to thank a few people at Grafana who help make all of this possible. We have Linda, we have Stefan, we have Dee and the team here have been just transformative, just amazing in their ability to help us overcome some of these core challenges. If you have any questions for me, I encourage you to come find me afterwards. Unfortunately, I don't have time to do it now, but I will hand across to our lovely mc and leave you all to engage in the next talk. Hopefully this was useful for some of you. Thank you very much.

Speakers

Andy Howden
Engineering Director — Delivery Hero

How Delivery Hero migrated 2,000+ users to Grafana Cloud and reshaped observability at scale

Speakers

Andy Howden

Still have questions?

Get every update