How Grafana Cloud is enabling HotSchedules to develop next-generation applications

From the cooks to the wait staff to the manager and beyond, it takes a village to run a restaurant. HotSchedules was founded in 1999 to provide the software support to keep those operations—shift schedules, sales receipts, recipes, inventory, perhaps even the kitchen sink—humming along. Today, the company serves more than 2.8 million users in 160,000 locations around the world, from mom-and-pop joints to global companies like Yum! Brands.

Four years ago, Denise Stockman joined HotSchedules to build a new platform for the development of next-generation applications. "It was like a startup within a much larger company," says Stockman, now the Senior Director, Infrastructure at HotSchedules. "We were coming up with new ways to ingest data from various sources and build applications to interact with that data." One of the company's marquee developments on this new system is Clarifi™, the first cloud-based, intelligent operating platform for restaurants.

Stockman and her infrastructure engineering team are tasked with building the core operational foundations for the Clarifi Platform by focusing on automation, observability, scalability, and reliability. From the beginning, she understood the role that metrics and the ongoing monitoring would play in support of the mission. "Four years ago, the APM [Application Performance Monitoring] market was primarily targeted towards mature applications," says Stockman. "So it was kind of an afterthought, a bolt-on."

A belief in instrumentation

Stockman had a different opinion: She wanted to invest in deep system, application and infrastructure instrumentation. "We should be aware of what our applications are doing," she says. "We should be able to add or remove metrics. If we're curious about running some kind of an experiment and observing the outcome of that experiment, we ought to be able to do it without being constrained by any APM framework or additional operational overhead."

Embracing standard open source technology, the team used collectd and StatsD for metric collection that sent metrics to Stackdriver. Stackdriver performed well until it was acquired by Google in 2014 and began following a roadmap optimized for Google Cloud. Since HotSchedules was on AWS, "it wasn't a good fit," says Stockman. So she and her team set out to find a more optimized solution.

Using a hosted monitoring service wasn't always her first instinct. "A million years ago, I was of the opinion that monitoring is one of the most critical aspects to delivering services to your customers, so you want to run it yourself. How could you trust somebody else to do it?" she says. "Then I had this 'aha' moment that this [running our own] isn't our core competency. If our monitoring services are really to be that reliable, then it needs to be even more robust than the systems that we're building. And there's just a scale of economy where that doesn't fit."

A million years ago, I was of the opinion that monitoring is one of the most critical aspects to delivering services to your customers, so you want to run it yourself. How could you trust somebody else to do it? Then I had this ‘aha’ moment that this [running our own] isn’t our core competency. If our monitoring services are really to be that reliable, then it needs to be even more robust than the systems that we’re building. And there’s just a scale of economy where that doesn’t fit.

Denise Stockman, Director, Infrastructure, Hotschedules

Enter Grafana Cloud

The goal for Stockman and her team was clear: find a solution that "provides portability but doesn't require our teams to run large scale metrics sources and collection systems within our own infrastructure." After a full review of the available solutions – including Datadog, SignalFx, and Sysdig – Stockman chose Grafana Cloud.

Grafana's offering aligned well with HotSchedules' wishlist: A full API. An open system. Open clients and agents. Data retention. Enabling of developer self-service. Configurable metric intervals. And visualization that's "more than just line graphs."

"We wanted to be masters of our own domain," she says. "We wanted something that allowed us to easily pour into it as well as the option for portability. If there's a feature that we want to use, either the Grafana folks can do it for us, or we can build our own plugin for it. Or, maybe somebody else [from the community] develops it."

Perhaps the most important factor? "The holy grail for us is to have lots of different data sources. We are able to bring in data from MySQL, Elasticsearch, Graphite, InfluxDB, Prometheus, and any future systems to effectively federate all of that information into one place so that we can have a single pane of glass," says Stockman. "We can all be looking at the same thing from any point in the company, like a business metric overlaid with engineering. You can see the context and identify and solve problems faster."

Monitoring in action

HotSchedules migrated to Grafana in December 2017 and now has nine data sources integrated in its Grafana dashboards. With its current usage at over 1.3 million data points per minute and 260,000 active series in a seven-day period, the company has fully embraced its monitoring solution. "It's now the first place that most people go when things start going sideways," says Stockman.

Before logging into systems or digging through log files, people are looking at their dashboards. "It gives our teams a way to quickly identify where to start looking to form a theory about why this incident or this issue is happening. It gives them a place to be able to dig in further," she says. The visibility for all these metrics enable their service delivery teams to quickly iterate on new features, observe their behaviors and respond to changes in their systems before their customers are impacted.

In the spirit of transparency, Stockman gives access to the metrics and the dashboards to anyone on staff who wants it. "Customer support, implementation, product management, other people that are not in engineering have access to this," she says. "We're now starting to learn how to have a similar way of talking about a problem and making observations across the entire freaking company, which is great."

Although engineers were always responsible for their own alerting and metrics in the previous Stackdriver service, back then "it was not an easy task," says Stockman. "And it's still ongoing. We manage it, help support them, give people pointers. But at the end of the day, we're committed to freedom and responsibility. We can give them recommendations, but we also have to be gracious and just be like, 'Oh, you'll find out.' And it's your home. You can rearrange the furniture however you want."

In the effort to win over hearts and minds, Stockman says empathy is key: "You need to understand where your users are coming from and try to find ways to solve points of friction or just lack of understanding."

She also relied on the people who stepped up as champions for metrics. "When we were looking for our next monitoring solution, we included them [developers] in the process as we were going through it," she says. "They were our beta users. They drove the requirements based on what they loved or disliked about the previous Stackdriver service. When we switched over, there was a lot of positivity and enthusiasm. We addressed a lot of the points of friction that our previous service had which went a long way to improving the day to day for our users."

With that buy-in, the culture at the company has changed too. "In the earlier days of the new platform, people would go into Slack and mention, 'Hey, I can't do this thing, your service is broken,'" says Stockman. "Now we're having more engaged conversations, 'Is it me or is it the service? Is this API call right or not? Should I be able to do this?' And you know, if they want to assert that the system is broken, they tend to first go to the metrics dashboards that we have for various services, and realize, 'Oh no, it's me.' Which is super cool."

A valued partnership

HotSchedules is currently rolling out the first phase of Clarifi applications that were developed on the Platform. "The market momentum looks huge – it looks like it's going to be a very successful product within the market," says Stockman.

All of this development has been done with Stockman's five-member infrastructure engineering team supporting all of HotSchedules' 200 developers, and Grafana's robust, hosted metrics solution plays a big role in enabling that. "Grafana Cloud enables us to achieve observability bliss at HotSchedules," she says. "We don't have to worry about scaling and maintaining the service, so it frees us up to focus on the most crucial aspect of our service delivery – publishing the right metrics, observing them, making decisions about how to deliver our service and ultimately delivering Clarifi with the best experience to our customers. It's become a central point for all developers, support staff, and product owners to understand how our services are performing and leveled up our internal conversations so we're now all talking from the same page."

For that, Stockman is more than satisfied with the decisions they made along the way. "Our relationship with Grafana feels a lot more like a partnership than just a business relationship," she says. "When we raise issues, they get attention, they get priority – and they get resolved. I have a trust in that."

Moreover, as a self-confessed "open-source hippie," Stockman says she's "absolutely elated to be paying money to a business that is helping to fund open source software. That is the beauty of open source and that's how it's supposed to work."