How a cooking platform whipped up a new observability plan with Grafana Cloud
As any good cook knows, if you want to create a top-notch dish, you have to use the best ingredients. So when the engineering team for Cookidoo — an online platform and app that features more than 80,000 guided recipes for the Thermomix, an all-in-one kitchen small appliance — realized the observability tool they were using to monitor the platform wasn’t delivering what they needed, they decided to switch to Grafana Cloud and OpenTelemetry.
The Germany company Vorwerk, which makes the Thermomix, serves up Cookidoo to more than 5 million paying subscribers and around 10 million connected devices worldwide. “You can imagine Cookidoo as something like a streaming service or Netflix or Spotify, but for recipes,” Dennis Rippinger, a principal software engineer at Vorwerk, explained in a recent ObservabilityCON On the Road talk. Each day, Cookidoo developers come up with about 3,000 commits, 1,600 pipelines, and around 60 deployments, and they’re working with an infrastructure made up of 16 Kubernetes clusters, 600 EC2 instances, 800 TB of data, and 1,600 Lambdas.
In the presentation, Rippinger and his colleague Tim Schrumpf, a senior site reliability engineer, discussed the tasty payoffs that came with their migration to Grafana Cloud, including automated dashboards, improved incident response, a reduction in costs, and the ability to make the most of business metrics.
Note: Vorwerk’s Cookidoo session from ObservabilityCON on the Road is now available to watch on demand. You can check out the full session on YouTube below.
A simmering problem
Cookidoo’s relationship with Grafana dates back to 2015, when the team used Grafana OSS to monitor AWS.
Although they moved to Instana for almost all of their needs, they were still able to integrate Instana data into Grafana dashboards.
With Instana, Vorwerk had what Rippinger called “complicated and fancy” charts to use in sales pitches, but he noted they actually weren’t very helpful. He said it was also difficult to integrate Instana with other tools they were using from different vendors and do things like have alerts displayed somewhere else, or get metrics from GitLab.
Although some people at the company loved Instana’s fixed dashboards for observability, Rippinger said they weren’t always ideal. “Sometimes, if you are really in a crisis situation, it can get difficult.” The tipping point, he said, came in 2021, when the team had to spend several hours addressing a Priority 1 incident, which kicked off with an error message about a lack of connection to the database.
The data they needed was visible, he said, but not easy to find. “The actual problem was hidden somewhere below, in a perspective of a perspective of a perspective,” he recalled. “It took some time to get there and inspect that to see that the problem was actually something else.”
Adding Grafana OSS to the mix
In 2022, the SRE team began to look for a tool to replace Instana. Although OpenTelemetry was still in its early stages, they thought it looked promising and felt that open source was the way to go. After all, when it comes to documentation, he said, “No vendor can produce [more] information for you and for your developers on how to solve things than the open source community.”
Over about six months in 2023, Vorwerk began transitioning more than a dozen teams to Grafana OSS. They started with their platform teams, getting the Kubernetes integrations and all the default logging needs taken care of first. The early adopters came next, and Schrumpf said they did hands-on workshops and tried to figure out things such as how to get the most out of their metrics and which exporters they can use for Java or Python.
“That also helped them to create templates for the other teams, which were not so eager to migrate,” he said. The team members who were already familiar with Grafana and knew how to write PromQL, meanwhile, were saying, ‘Give it to me right now. I need it’!"
Out of consideration for their engineers’ time, the Vorwerk team then decided to migrate to Grafana Cloud. “We were getting to be experts in maintaining a Loki stack,” Schrumpf said, “but we shouldn’t be experts in maintaining Loki stack — we should be experts in providing recipes to customers.”
One year in, Vorwerk had about 350 dashboards, 600 alarms, 2.8 million metrics, 7.07 TiB of logs per day, 9.90 TiB of traces a day.
“And now our observability team can help developer teams to get the most out of Grafana Cloud instead of maintaining the Loki Stack,” Schrumpf said.
Making the most delectable Grafana dashboards
To help Cookidoo’s developers, Rippinger’s team came up with a set of default dashboards. They started with RED (rate, errors, duration) metrics, and also created a standardized way to know where in the world information was coming from. For instance, if the error log line is coming from Australia, the developers will see a kangaroo next to it.
Baking in business data
Rippinger recommends including business data in your metrics, but noted it isn’t something you simply get with an “automatic magical augmentation.”
In his use case, he works on a component that sells subscriptions. “We are making a sale or we are not making a sale. This is something that you can easily count. When you start to instrument this and make your dashboards, you start to get a better understanding over time of your domain: How much are we selling? In what countries? — something like that.”
The dashboard can be tweaked constantly, he explained, and it also makes it easier to see when something is just a miniscule case that happens rarely.
Another time when having the business data can be helpful is when you’re deploying code. The metrics may indicate traffic is good and everything looks great, but if you see sales are slow, for example, “it tells you something is really wrong,” Rippinger said.
He also discussed the common scenario in which something isn’t working for managers in one country, and calls start being made. Eventually, an engineer ends up on an incident call, and having business dashboards makes it easy to see what’s actually going on and whether or not it might be a local incident.
Layering in traces
The Cookidoo team is being encouraged to include traces in their dashboards. Each week, one developer acts as a sheriff for everything that runs with operations, and having traces invites them to look into and identify issues, such as why a call is slow.
“I’ve had this before, where we’re sending Kinesis events for some reason,” Ripping recalled. “Sometimes Kinesis takes two seconds to initialize, but this is nothing that a customer needs to be waiting for.” They realized it could be an asynchronous task and turned it into one — an idea that only came out of looking at the traces.
“Working with traces came really handy over time, and we also learned that in the end, it’s just a string and some kind of type you can put in there. If you have a scenario where you have multiple teams working on this, they can come to the conclusion that they want to add some information to their trace — the same semantic — but maybe find out a different name,” Rippinger said.
Within tracing, there are different types that are important, he added. For instance, “true” can be a string true, or it can be the type true. To address this, they started creating documentation that compiles itself into a Java library. “That allows you to come up with standardized names also for tracing, so you can make a TraceQL query and let the auto completion do its thing. Everybody knows this is the key to search for something super specific.”
He also highlighted other key features of their setup:
- Logging markers: Teams are encouraged to use these to know how many logs they have and the appropriate amount of errors to expect.
- Clocks: Because Cookidoo has team members distributed around the world, it’s helpful to have clocks in their dashboards so they better understand if it’s a major issue at a peak usage time or if it’s occurring in the middle of the night with limited impact.
- Aggregate error markers: The team uses bar graphs and a simple Loki rule to understand if errors are isolated, regular occurrences that occur over time, or if they’re a larger issue that’s worth investigating.
The icing on the cake: value optimizations
Having a lot of data is great, but not when your company is paying per metric. Vorwerk was able to manage its costs thanks to Adaptive Metrics in Grafana Cloud.
Schrumpf referenced an active metric series dashboard from 2023, and explained how in the beginning, they enabled more clusters and more integrations and had around 2.4 million active series. At one point, he spent an hour or two clicking to enable Adaptive Metrics, and he significantly reduced the company’s metrics — without anybody noticing.
With Adaptive Metrics, they don’t need to talk to every team and say, “Do we really need that detail level or is that okay? You don’t use it in your dashboard,” he explained. “Now we could just take that off the team’s shoulders and do it in a central place.”
At the beginning of 2024, there was no Adaptive Logs, so they had to build something themselves. “We started with a showback, of course, and created another Grafana dashboard to see which service is sending how much logs,” he recalled. With that new information, they could talk to a team about why, perhaps, they had double the logs of another team, and recommend turning off debug logging or production. “That helped to have at least a 40% reduction of log volume,” Schrumpf said. “I’m interested to see what we can do with Adaptive Logs in the future.”
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!