There is a lot of talk about graphing all the things, but have you ever considered graphing all the people – in particular their on calls – as well?
“Not letting people burnout on call is something that is being talked about in the industry,” said Jordan J. Hamel, Design Engineer at the biotech company Amgen. In fact, earlier this year, Grafana Labs' Director of UX, David Kaltschimdt, shared foolproof Kubernetes dashboards that improve the quality of on call shifts for engineers.
But overall “being on call is something that doesn’t get enough attention from a measurement perspective,” said Hamel.
At Amgen, a values-based organization that transforms new ideas and discoveries into medicines for patients with serious illnesses, Hamel uses Grafana to track on calls for his team and encouraged other organizations to do the same during his talk at GrafanaCon in L.A.
“There’s a lot of value in collecting the data, no matter the size of your team or the size of your organization,” said Hamel. “What if we know that, in the course of a year, everyone shared their equal time on call, but the on call experience wasn’t exactly even?”
Here, Hamel outlines two major improvements that have resulted from using Grafana to manage on call rotations.
“From an operations perspective, you want to connect people that have questions to the people who can give them answers quickly,” said Hamel.
Often times organizations have external customers or teams who don’t have regular access to on call schedules. So when a problem arises, they don’t know who to reach out to for assistance with an application, especially during off-hours.
For example, if an important client notes that an app error is causing user dissatisfaction, they need a way to reach someone within the organization to resolve the issue. Or, often within some organizations, there is a separate 24-hour NOC team or an SRE team on duty that users need to contact urgently.
With Grafana, Hamel has set up SLO dashboards in which expanding a row at the bottom could easily solve this problem.
“If you guys don’t use rows in Grafana, it’s great because the queries don’t actually fire until you click on the collapse panel,” said Hamel.
Within the row, you can provide information such as the current on call contact, additional contact info, and the time the on call rotation starts and ends. Data such as the escalation level of a particular alert can also be tracked and communicated to users.
“We have dozens of teams, but we just find a tag identifier and then make this all templated so that you get the right on call person for the right application,” said Hamel.
Now when there is an issue, Hamel said, the dashboards help “to connect people faster and reduce the friction to context.”
Better Managers and Better Team Morale
Despite a manager’s best efforts, often times not all on call shifts are created equal. While engineers may spend the same amount of time on call as their peers, the rotations are often not balanced because of release schedules, application updates, or unexpected outages.
Engineers take it upon themselves to determine whether or not they have reached a tipping point at work, but Hamel says that managers can help reduce burnout on their teams simply by tracking the on call process in Grafana.
“People are part of your system, so you should measure the metrics to support them,” he said. “And Grafana is the right place to do it without overcomplicating it.”
Hamel showcased how a dashboard can help track the number of days each engineer serves on call as well as the number of incidents they responded to during their shifts. Now managers are more aware of how balanced their team workload is (or isn’t), and this information has also empowered engineers.
“For those [engineers] that have had the phantom buzz on your cell phone in the middle of the night, this gives you a chance to have a conversation with your manager,” said Hamel. “Instead of feeling like you’re always being called in the middle of the night, now you will have the data to show it.”
And in the end, said Hamel, everyone will “end up with a happier on call experience.”
To watch Hamel walk through his best tips for collecting on call data for Grafana, click here.