Why Clearco switched to Grafana Alerting, Grafana OnCall, and Grafana Incident
Working with technology means dealing with incidents or outages from time-to-time, so staying on top of problems is essential. Back in the spring of 2022, Clearco, the world’s largest e-commerce investor, had an alerting system set up to catch issues, except they had one problem: Clearco’s Customer Success team would learn of a problem before a notification even went off.
“As a growing technology company, outages are all too common in the industry,” says Clearco Senior Infrastructure Engineer Paul Shahid, who is part of the observability team. “Incidents should be caught early and with something that we can measure, which directly impacts our customers.”
For Clearco, the solution was Grafana Cloud. The company had already been using Grafana for dashboards, but Clearco realized it could also provide them with a new way to handle alerting and incident management thanks to the Grafana Cloud Incident Response and Management (IRM) suite, which includes Grafana Alerting, Grafana OnCall, and Grafana Incident. Shahid, along with his team, helped orchestrate Clearco’s migration to become one of the first Grafana Cloud customers to use the IRM suite beginning in late summer 2022. Six months later, the entire 40-person engineering team has logged into Grafana Cloud, and about 50% of them are using features such as Alerting, OnCall, and metrics.
In the ObservabilityCON 2022 session “Incident response made easier with Grafana Alerting, OnCall, and Incident”, which is available on demand, Clearco shared the success their team has had with the Grafana Cloud IRM suite so far. We were interested in hearing more about the switch, so we chatted with Shahid about his decision-making process and the benefits of the IRM tools. Here are some highlights from our conversation.
This interview has been edited for length and clarity.
You joined Clearco in March 2022. What was the state of the company’s observability at the time?
The company was using Grafana’s open source dashboarding platform, but not for alerting. We were in early stages of discussions with Grafana Labs, and the mandate I was given was to make sure Grafana is the right tool or come up with alternatives. The more I looked into it, the more I was like, “Yeah, absolutely.” Grafana does a fantastic job — especially with Grafana Incident and Grafana OnCall — of synthesizing the information you need in one place.
Why did you decide to switch to Grafana Cloud?
Originally we used Grafana OSS for dashboarding and graphing, so we wanted to use Grafana Cloud for the same reasons, but as a single source of truth. In addition to those features, we could also take advantage of Grafana Cloud enterprise offerings, like Alerting and OnCall. Then we started talking with folks at Grafana Labs about OnCall and Incident, and it became more apparent that it was the right solution for us based on these built-in features. On top of that it made sense in terms of effort and complexity: Moving to Grafana Cloud allowed us to direct our effort at getting more value from our metrics.
How have the past few months been just using Grafana Cloud?
We’ve had to use Grafana Incident a couple of times, but it happens in technology. We like it and have gotten a lot of good feedback about it from our engineering team. We’ve added to our automated alerting through OnCall and added to our incident response process through Incident, and we’re looking to do more. Things are still being caught without automation in different parts of the business. It’s something we’re always improving on and getting better at. OnCall and Grafana Alerting are going to help with that. Also, we shouldn’t be alerting too much, as that causes alert fatigue and hits to developer satisfaction, and we shouldn’t be waking people up when we don’t need to. Grafana has a lot of different features that are going to help us avoid doing those things.
How have you evolved your alerting thus far?
One of the biggest things we’re able to do is synthesize logs and metrics together via Kubernetes labels. It’s a small feature, but when you’re in the Explore tab, you can split out two different views of logs and metrics, then sync them up on a timeline. That’s a necessary feature for us, and it helps with debugging. The other big thing is team-based alert routing via labels and Kubernetes. We’ve labeled a bunch of our microservices with team names and with service names, and then we’ve transformed that into an actual label in our metrics backend. That way, you can literally query on any metric and route alerts to a particular team. That’s a killer feature. I don’t know of any other tool that can route alerts based on Kubernetes labels from a single unified platform – not even Grafana’s competitors – so that’s pretty awesome.
When it comes to incident response time, your postmortem time, and your overall performance metrics, are those better now?
For sure. I think the thing that was unexpected is that Grafana Incident makes the incident response process very clear. Before we had Grafana Cloud and before we used Grafana Incident, it was unclear what had to happen when someone noticed that there was an actual problem. Now what typically happens is that an outage or an incident starts to happen, and then right away someone goes and creates a Grafana Incident page. That creates a Slack channel, a Google Meet, and it automatically generates a postmortem. It makes it clear what the process is, and it’s very clear where to go and what to do. It cuts down on confusion during a time where stress levels can be high and it gives everyone a clear, common process to gather information while working towards a resolution.