Pro Tips: How to Decrease MTTR and Increase Uptime with Grafana and VictorOps

Published: 2 Jul 2019 by Michelle Tan RSS

We can sift through oceans of data. Alert on predetermined parameters. Deliver multiple commits a day.

But as organizations leverage these layered, complex monitoring systems, “we also have to start practicing observability to enrich the actions that we take to solve problems as they occur and drive continual improvement,” said VictorOps Product Marketing Manager Melanie Postma.

VictorOps is one tool that can help accomplish that. Acquired last year by Splunk, VictorOps is an automated alerting system that gets the right alerts to the right person, reducing alert fatigue.

At GrafanaCon 2019, Postma outlined four steps to decreasing MTTR and increasing uptime using VictorOps and Grafana.

Utilize Robust Monitoring Solutions

It’s great to have a good baseline of information and understand what your organization’s infrastructure looks like when everything is green, said Postma. “However, we’re human and we don’t catch everything,” she said. “It’s impossible to predict every single negative impact to your infrastructure … especially as [teams] commit multiple times per day.”

When problems do arise, “alerts are really only useful if they’re getting to the right person at the right time,” said Postma. “They can’t just die in an email inbox.”

Enter VictorOps, which can be used to direct alerts to the person who is on call or the expert who can step in and address an incident. “We’ve seen this in action with many customers,” she pointed out. “The latest of which is [credit union service] PSCU. They actually reduced their MTTR from four hours to two minutes and gained a ton of accountability.”

Understand Impact of Deploys

“A deploy is really understanding what’s going on in that incident,” said Postma. In other words, “don’t just set it and forget it.”

As organizations move faster than ever, “testing in production is normal now,” she said. As a result, “we may be causing these alerts to fire.”

“We really have to observe how [deploys] impact our infrastructure and then get those alerts to the right person at the right time to reduce MTTR,” said Postma.

With VictorOps, engineers can observe what deploys they and their peers have worked on that may have triggered an alert. “You have a little bit more context and data to really get to the bottom of it quickly,” said Postma.

Providing More Context

When alerts inevitably happen, the goal is to provide the most content and context to first responders so that they can act fast.

“One massive way to reduce meantime to mobilization and meantime to resolution is by providing the most context possible,” said Postma.

“VictorOps ingests alerts from Grafana, but it also allows you to append Grafana graphs to specific alerts,” said Postma. “So at 3 a.m. you can quickly glance at the metrics and get to work.”

Also attaching references such as runbooks, annotations, and Jira tickets to the alert will position those on call to successfully troubleshoot.

“Whether you receive an alert that you remember troubleshooting three months ago but you can’t remember exactly what you did or you’re a first-time on call user in your new company, having runbooks, annotations, and Grafana graphs help reduce meantime to mobilization drastically and allow you to work together to come to a resolution much more quickly.”

Data-Driven Improvements

Observability is not just about asking questions of your system. It’s also about reviewing processes and taking actions to improve how teams are putting out fires moving forward.

“The only way that we can really improve our processes is to understand what happened at every single step of that incident lifecycle and then make data-driven improvements to either reduce the chances of this incident happening again or learn from it,” said Postma.

Post-incident reviews allow your team to blamelessly walk through every step that happened in that incident lifecycle and plan for ways to improve continuously. “Any tweaks will help the next user move forward in the incident faster,” said Postma. “Maybe there’s a more effective graph. Maybe you can update a runbook for your team.”

Or just maybe “you can even find a way to make this incident never happens again,” suggested Postma. “Who knows?”

For more from GrafanaCon 2019, check out all the talks on YouTube.