Faster incident response through distributed tracing: Inside Glovo’s use of Traces Drilldown

Trevor Jones

•

2025-07-02•7 min

It’s almost 1 p.m. on a Monday afternoon and you’re hungry. You pull up your meal delivery app and select your favorite restaurant and dish. Then you go to check out and nothing happens.

Your frustration mounts as you get hungrier by the minute. But there’s frustration on the other side of that transaction as well—engineers are scrambling to figure out what’s wrong as orders drop and revenue losses rise.

This is the type of scenario you’re trying to avoid if you’re on the SRE team at Glovo, a subsidiary of Delivery Hero and an on-demand food and grocery delivery platform operating in 23 countries across Europe, Africa, and Asia.

To get to the bottom of a recent incident to prevent it from happening again, Glovo recently turned to Grafana Traces Drilldown (previously Explore Traces), an application that allows you to quickly investigate and visualize your tracing data through a simplified, queryless experience. And in a recent GrafanaCON 2025 talk, Staff Software Engineer Deepika Muthudeenathayalan and Senior Software Engineer Alex Simion shared their experience using the app to help find the root cause of an incident with multiple, cascading failures.

You can check out the full video below, or keep reading to learn more.

The role of tracing and how metrics and logs only go so far

Before we dig into Glovo’s specific use case, let’s walk through a hypothetical scenario to illustrate the importance of traces.

It’s the middle of the night and you just got an alert that your users can’t check out. First, you check your metrics, which likely triggered the alert. Metrics tell you something happened, but they probably can’t tell you what happened. Next, you check your logs, which can tell you what happened—maybe an exception was thrown?—but finding that needle in the haystack can be challenging. To complicate things further, your team is using microservices, and a single backend service is just one piece of a larger puzzle.

This is where traces come in.

What is tracing?
Tracing is basically logging improved with context that is shared across HTTP boundaries, which allow you to see the flow through the various components that your system contains.
Taking a step back, a trace represents an entire request across multiple services and visualizes start to finish the flow to identify bottlenecks and or dependencies. They’re often high volume and high cardinality, and they can contain logs or events as well. A span represents a particular operation within the request flow dealing with where it occurred. They usually include timing, success or failure data and relevant tags or metadata, which we call attributes. Attributes are basically key-value pairs.
Spans can also contain events that record significant occurrences, such as an exception, or they can contain links to other spans or other traces, which can be helpful when you have really large traces.

Traces provide a structure to visualize what would otherwise be a pretty chaotic environment. This can be incredibly useful in distributed systems, where you need to pinpoint errors in the context of a larger system.

However, this doesn’t necessarily mean that one telemetry signal is better than another. They really work best together: Metrics tell us that something might be happening; traces tell us where it’s happening; logs tell us what is happening.

And that combined power is why we built Traces Drilldown. It takes away the complexity of digging through each signal and writing queries. Instead, you have a point-and-click experience that lets you get value from your traces faster.

How Glovo got to the root cause of an incident with Traces Drilldown

One of Glovo’s most important metrics is used to track orders created. The volume changes throughout the day, with more orders around lunch and dinner times. And as you can imagine, incidents that occur during those busier times have an outsized impact. So going back to our example from the beginning, this is not what you want to see as you close in on 1 p.m.:

Grafana dashboard shows a big drop in orders created

“As SREs, we are kind of in the dark when such incidents happen because any part of our distributed system could fail causing the order loss,” Deepika said. “So we built many custom dashboards and metrics in Grafana, which helps us during such incidents.”

In this example, engineers dug deeper into a RED metrics dashboard and learned it was a downstream dependency that failed because it was rate limiting checkouts. From there, they learned that it was tied to a code change and quickly rolled back the deployment. In under five minutes, they were back to normal.

But an SREs job doesn’t end there. Next, they needed to find the root cause and make sure it didn’t happen again.

This involved a lengthy process of searching for clues and false starts. First, they noticed a spike in application errors. They also noticed high checkout latency, which was surprising given the rate limiting. Next, they went to their logs, and after a bit more digging, they suspected the issue was tied to a database.

But after checking the database in question, they saw it was performing normally, which meant the database issue was the effect, not the cause. From there, they tried checking their traces, but that didn’t help either.

Getting answers with Traces Drilldown

With few options left, they turned to Traces Drilldown, which, at the time, was a new feature they had early access to. They ultimately opted to work backward from the product-catalog API traces, which was as easy as selecting the service name and filtering from there. And since they wanted to focus their search on rejections from the product catalog, they also filtered by HTTP status code.

“Drilldown shows you very nicely that you can select one of the status codes that appears in the spans,” Alex said. “The next step was very useful because we didn’t know exactly what attributes to look for, so the comparison tab gives you insight on what attributes are having different values from the baseline.”

Next, they identified a client with much higher values than the rest, and filtered by that as well.

“Here we knew that this was a cascading failure. We knew that the database issue was caused by something else, so we wanted to check some exemplars from the beginning of the incident to see what caused this. So that’s exactly what we did here. We checked one of the first seconds of the incident, opened a trace, and looked inside the trace.”

From there, they were able to go deeper and find longer running spans tied to request times and also began looking at how it all connected to their database issue. Ultimately, this led to them quickly discovering the root cause: The checkout API was calling the product catalog, which was getting rate limited. There were retries and long wait times between retries, and at the same time they were holding connections from the thread pool for the database and new connections to the checkout API. As a result, new requests could not fulfill their processing because the connection was being held.

Thanks to Traces Drilldown, they were able to resolve the issue and make sure it didn’t cause more problems with customer orders in the future.

And today, Traces Drilldown is part of the SRE teams workflows as they look for errors and latency issues.

“I mainly use the error panel because it’s very helpful to pick exemplars,” Deepika said. “If you have a deployment, you can pick from a particular version of the service. If one of your pods is crashing because of memory or something, you can pick traces from those pods. And there are various resources and span attributes you can use.”

Check out the full video to learn more, including how OpenTelemetry and Traces Drilldown are making Glovo rethink their tracing policies.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case.Sign up for free now!

Faster incident response through distributed tracing: Inside Glovo’s use of Traces Drilldown

The role of tracing and how metrics and logs only go so far

What is tracing?

How Glovo got to the root cause of an incident with Traces Drilldown

Getting answers with Traces Drilldown

Up next

Related content

Related videos

Related docs

Related products

Still have questions?

Get every update

Faster incident response through distributed tracing: Inside Glovo’s use of Traces Drilldown

The role of tracing and how metrics and logs only go so far

What is tracing?

How Glovo got to the root cause of an incident with Traces Drilldown

Getting answers with Traces Drilldown

Related Content

Up next

Related content

Related videos

Related docs

Related products