How SpotOn overhauled its observability strategy with standardized tagging and Grafana Cloud

Kristin Knapp

•

2025-04-12•9 min

Many engineers would agree: migrating to a new observability platform can be a serious undertaking. But it’s also the perfect opportunity to step back, revisit some of the foundational practices that drive your observability strategy — and reap some major benefits, as a result.

This was the case at SpotOn, a provider of restaurant point of sales systems and business software, which recently migrated from four disparate observability tools and consolidated on Grafana Cloud. As part of that project, the engineering team implemented a standardized tagging taxonomy that, alongside Grafana Cloud, helped them streamline alerting and incident response, cut costs, and provide data-driven insights to key stakeholders.

“Ultimately, in our opinion, the goal of an observability team is to provide useful data to your organization to make quality decisions,” said Sai Soundararajan, Engineering Manager at SpotOn, at ObservabilityCon on the Road in the San Francisco Bay Area earlier this year.

You can read on to learn more about SpotOn’s migration story, and also check out their full ObservabilityCON on the Road talk in the on-demand video below.

Cutting through complexity: why SpotOn needed a change

SpotOn has a diverse and fairly complex product and technology landscape, ranging from the handheld POS and kitchen display systems you commonly see in restaurants to the backend services that underpin online ordering and loyalty programs.

Different teams at the company are responsible for different product and technology areas, so if something went wrong, it was tough to know who to reach out to for answers and a resolution. In many cases, engineering would simply resort to using the git-blame tool to determine who last modified a file.

“One of the biggest challenges we had was figuring out who owns a service,” explained Jeremy White, VP of Engineering at SpotOn. “This is not great when you’re in an emergency setting. We needed a way to consistently get the correct answer, not just find someone who wrote that system several years ago.”

The second big challenge was a lack of clarity around which underlying resources, from logs to databases, were supporting each product or service.

“We found that a lot of the teams were not familiar with what resources or what different pieces make up their systems,” White said. “They’re focused primarily on the code, and don’t necessarily understand all the other pieces that come together.”

Given these challenges, the team knew it was time for a change. “Our goal was to try to get some transparency around how we’ve organized all of our systems, and who those systems and components belong to,” White said.

Tag, you’re it: an overview of SpotOn’s standardized tagging taxonomy

The solution, the team found, was to migrate off their multiple observability tools and onto Grafana Cloud, establishing a robust and standardized tagging system across their infrastructure and application environments along the way.

At its highest level, their taxonomy was built around:

Products/services: The products and services that customers purchase and use.
Domain/teams: Groups of employees within the product and engineering teams.
System/components: Engineering assets that are built, hosted, and deployed. These include software components, such as microservices; infrastructure resources, such as databases; and APIs.

Leveraging Backstage, an open source framework for building developer portals, the engineering team started to implement the tagging system. They made it a requirement to define domains in a catalog-info.yaml file within their GitHub repos.

“When we implemented this, we had over 1,800 repositories and we made it a requirement where every single one of our GitHub repositories required this metadata file for catalog info,” White said. “In that file, we defined a lot of other things, but the key part is that we specify the domain. That means we know who owns that repository from a domain perspective, so we know their leader and then can drill down from there if we need more information.“

Once these tags are applied at the infrastructure layer, they become Prometheus labels that Grafana Cloud can scrape directly.

To apply tags at the application layer, the engineering team embraced OpenTelemetry as a means of standardization, along with custom SDKs for their developers.

“To extend the standardization technique into the application-level layers… our answer was twofold,” said Soundararajan. “OpenTelemetry is the foundation, and then we have custom SDKs for developers, which are absolutely crucial from a centralized observability standpoint because you can use them as an opportunity to expedite OpenTelemetry adoption within the company.”

Through the use of zero-code instrumentation, the team significantly reduced the “cognitive load” for developers to instrument applications, allowing them to focus instead on business telemetry and metrics. What’s more, with Grafana Cloud Application Observability, an out-of-the box solution to monitor applications and minimize MTTR, they can quickly and easily track app performance in real-time.

“All of our SDKs across four different languages injected the same set of labels like service name, service version, and deployment environment,” Soundararajan said. “And the good incentive for any developer to actually start instrumenting application observability is that, if you use Grafana Cloud, the moment you instrument your application and deploy it in a cluster, you get a RED metrics dashboard out-of-the-box.”

A screenshot of a dashboard in Grafana Cloud Application Observability.

Lastly, the engineering team wanted to extend this level of standardization to their telemetry pipeline to avoid having any missing labels or inconsistent tagging. Fortunately, using Grafana Alloy as their programmable telemetry pipeline made this piece fairly straightforward, the team noted. Essentially, the pipeline takes any raw or unstructured data and passes it through a processing layer to perform the extraction, transformation, and standardization process. Then, a label is applied before it reaches Grafana Cloud.

“With Grafana Alloy service discovery, we tap the pod domain and sub-domain labels and use them for all the signals,” said Soundararajan. “So all your signals — metrics, traces, and logs — will have the same set of labels.”

The results: faster troubleshooting, cost savings, and an observability culture shift

After migrating to Grafana Cloud and implementing their standardized tagging system, SpotOn saw a number of benefits, ranging from streamlined troubleshooting to an overall observability culture shift that extends far beyond the engineering team itself.

‘Foundational’ dashboards for faster troubleshooting

Now, when an issue arises, the engineering team no longer has to rely on the git-blame game to find the stakeholders and answers they need.

Instead, they rely on “foundational” Grafana Cloud dashboards to quickly and easily filter by business domain (for example, guests or payments). Because the labels are consistent, teams don’t have to build unique dashboards per service; everyone sees the same labeled data, making cross-team troubleshooting faster and easier.

A screenshot of a foundational Grafana Cloud dashboard used at SpotOn.

“Since we have solved the problem of labeling a source, all we then had to do was consolidate and create a dashboard based on the domains. You can now filter the business domains instead of worrying about technical details,” Soundararajan said, adding: “You have this bird’s eye view of all infrastructure pieces from one single context.”

Cost optimization

Prior to their migration, the SpotOn engineering team also struggled with cost allocation due to that lack of visibility into which teams were responsible for which projects or resources. As a result, monthly cost overages were common.

“We would have to go on a treasure hunt trying to figure out which team turned on a feature, added extra logs, or added extra metrics,” White said. “It was very difficult.”

Now, the team is easily able to identify sources of high resource utilization, and make informed decisions around right-sizing their AWS resources — a change that has led to millions of dollars in annual cost savings.

“This has fast tracked a lot of [cost] discussions we had before, where we would simply hit a wall trying to get to conclusions,” White said.

Streamlined alerting

By using Grafana Cloud IRM, and leveraging their resource tags and labels as the routing intelligence behind their alerting system, the SpotOn engineering team was also able to significantly streamline alerting.

Before Grafana Cloud, the team had more than 870 alerts with no clear ownership. Those alerts would often be routed to the wrong team member (or even former employees), while alert fatigue and noise hindered incident response. Now, they apply infrastructure as code principles to alert management, and benefit from centralized alerting policies, consistent labeling, and team-level customization.

“Grafana makes it extremely easy to create an alert from a dashboard,” Soundararajan said. “The moment the alert fires, all the alert instances will have the domains and the subdomain information, which actually was configured through code as the notification policy.”

Incident response, as a result, is now controlled rather than chaotic.

“No more trying to pull in everyone all at once,” White said. “We can now be focused and pull in the team that really is relevant for the particular issue.“

A ‘big’ cultural shift

Overall, the move to Grafana Cloud — in conjunction with the new, standardized tagging practices — has helped the SpotOn engineering team focus more on business outcomes and providing powerful, data-driven insights to stakeholders outside their team.

“Because of those foundational dashboards, we stopped asking engineers to create dashboards that included CPU memory and things like that. Those [technical details] became just baked in,” White explained. “We were able to shift to focus on the metrics that really matter, because dashboards should be telling a story,” White said, adding: “Now, product and sales teams can look at these dashboards and quickly answer questions around what’s working and what’s not.”

At the end of the day, SpotOn now sees observability as a “team sport.”

“It’s like a powwow where everybody chimes in, and everybody has some rewards to reap,” Soundararajan said. “For us, it was about the support of your R&D team, your infrastructure team, and having Grafana being a part of that same team.”

Grafana Cloudis the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

How SpotOn overhauled its observability strategy with standardized tagging and Grafana Cloud

Cutting through complexity: why SpotOn needed a change

Tag, you’re it: an overview of SpotOn’s standardized tagging taxonomy

The results: faster troubleshooting, cost savings, and an observability culture shift