Cutting complexity and costs: Why a leading financial infrastructure and blockchain company migrated to Grafana Cloud for centralized observability

Cutting complexity and costs: Why a leading financial infrastructure and blockchain company migrated to Grafana Cloud for centralized observability

Sometimes, the old saying is true: less really is more.

This was a key lesson learned by the platform engineering team at one leading financial infrastructure and blockchain technology company when they migrated from a number of observability tools, including Datadog, Sumo Logic, and Sentry, and consolidated on Grafana Cloud.

“Before Grafana Cloud, we had this battle of everyone doing their own thing,” said one Senior Staff Platform Engineer at the company. “And one of the things we were trying to do as an organization was to get everybody working the same way and on the same tooling, within this mindset of, ‘what are we currently using and what makes sense long-term and cost-wise?’”

What made sense, ultimately, was a migration to Grafana Cloud — a move that helped the company not only gain control over their observability costs, but also streamline troubleshooting and empower their platform engineering team in ways that simply weren’t possible before.

Why a siloed approach wasn’t cutting it

Largely resulting from a series of acquisitions, the company’s observability strategy before Grafana Cloud was driven in silos. Teams were using disparate tools, which slowed down incident response and resolution — and the developer experience, as a whole, suffered because observability data was scattered across systems.

At the time, teams were primarily using Datadog for APM and metrics, and Sumo Logic and Sentry for logs. Grafana Enterprise and Prometheus were also part of their monitoring stack.

“We had all this telemetry in different places,” explained a former Senior Platform Engineer at the company. “You had to jump between different tools to figure out what happened. To correlate traces to logs, you’d have to learn three different tools. So it was hard to investigate an issue and ramping up took people a long time.”

In addition to these inefficiencies, a siloed approach made it difficult for the team to track, let alone optimize, their observability spend.

“These tools were being used without a lot of thought put into the actual usage and what we were getting out of it,” the Senior Staff Platform Engineer said. “And because it was so spread out, we had three different billing sources, and it was tough to understand what was most important. So, one of the main things that we were looking for was a better way to control costs.”

Why Grafana Cloud: a centralized, cost-effective solution

In their search for a new observability platform, the firm had three primary objectives:

  1. Create a unified view of telemetry data
  2. Find a solution that’s highly customizable and rooted in open source
  3. Gain the ability to rein in costs

Grafana Cloud checked all those boxes, and Adaptive Metrics — a feature that aggregates unused and partially used metrics into lower cardinality versions of themselves to reduce costs — was an especially big draw for their platform engineering team, allowing them to easily identify “problem metrics.”

“Datadog wasn’t very good at surfacing the most expensive metrics," added the former Senior Platform Engineer, “whereas with Grafana Cloud, there’s a prebuilt dashboard that gives you all that information and it’s very, very clear.”

The team also chose Grafana Cloud because of its roots in open source, and, in particular, its close alignment with the Prometheus and OpenTelemetry open source projects. Given that Prometheus is the standard for metrics collection across the organization, and the engineering team wanted to minimize vendor lock-in risks, an open source approach was “critically important” in their search for an observability platform.

“We asked, ‘what would make sense if we eventually needed to move away from someone hosting our data to us hosting our data’? If we went with Datadog, it was closed source, so we wouldn’t be able to do anything," said the Senior Staff Platform Engineer. “If we went with Grafana Cloud, it’s based on open source, so we could bring these tools inside or find alternatives.”

Plus, the team took note of Grafana Labs’ active involvement in Prometheus and OpenTelemetry and their respective communities.

“We noticed a lot of the Grafana engineers are contributors to Prometheus, and they’re contributors to OpenTelemetry, where you just don’t see that as much in Datadog,” the former Senior Platform Engineer said.

A closer look at the company’s migration to Grafana Cloud

The platform engineering team migrated about 40 dashboards and 50 alerts from Datadog to Grafana Cloud. The process took roughly a year, but most teams had migrated their alerts and dashboards, and were sending data to Grafana, within about 8 months. The team noted Grafana Cloud’s native support for Prometheus went a long way, in terms of facilitating the move. 

With more than 10 million metrics series, the engineering team sharded Prometheus and ran it in high-availability mode, which Grafana Mimir — the open source, horizontally scalable, highly available, multi-tenant TSDB that powers Grafana Cloud Metrics — “supported seamlessly.”

Of course, any major migration comes with challenges. So when the team encountered a roadblock, or sought guidance, they found the support they needed in Grafana Labs’ Professional Services. They set up regular office hours to ensure engineers and developers could connect directly with Grafana Labs teams to get the answers they needed. 

“With Grafana, we had a dedicated solutions engineer and we had a lot of people helping us out. Whenever we had issues, we knew the responsiveness was very fast.”

Benefits of migrating to Grafana Cloud: ‘A lot of control over our costs’

Ultimately, the team’s move to Grafana Cloud has paid off in more ways than one.

First, through the use of Adaptive Metrics, they’ve been able to significantly reduce their metrics volume — from about 20 million to 13 million — and save roughly 30% annually. The team also uses the Exemptions feature in Adaptive Metrics to exclude certain metrics from aggregations and preserve critical data they know their team will need.

“Overall, Adaptive Metrics has led to a lot of control over our costs,” the Senior Staff Platform engineer said. The team also generates their own APM metrics using the OpenTelemetry Collector to control the cost-precision tradeoff of high-cardinality histogram metrics.

What’s more, by consolidating on Grafana Cloud, the team has been able to unify logs, metrics, traces, and alerts to get deeper visibility into application performance and resolve issues faster. For example, by standardizing on Grafana Tempo — the open source, highly scalable distributed tracing backend that powers Grafana Cloud Traces — engineers can “finally” see the full picture of communication between microservices in their environment.

“Now, we share tracing that spans the initial request through the entire stack, and it’s very, very detailed,” the Senior Staff Platform Engineer continued. “We see every call and every checkpoint. We were recently debugging an issue where it looked like our proxy was receiving the request, but never handing it off to the workload underneath it. We could see the actual gap. We never had that kind of visibility throughout our application process before.”

Grafana Cloud has also simplified the process of configuring and sending alerts.

“What is quite unique to Grafana is the ability to create an alert once,” said the former Senior Platform Engineer. “We can create an alert as the platform team, make sure it’s correct, and then route that to every other team automatically, whereas in Datadog, each team would have to know to create that alert and create it properly, so it just never got done. Now, we can become the experts for alerting, and any of the other teams can just automatically pick it up for free.”

For the company, this theme of ease-of-use and enablement, more broadly, has been another big benefit of moving to Grafana Cloud.

“I feel like Grafana is a bit more of a platform engineering tool than Datadog is. As part of our migration, my colleague made this example dashboard that you could just copy. It had latency for SQL queries, request rates, and error rates, and it would just work for anybody that integrated with how we were doing OpenTelemetry.”

What’s next

Looking ahead, the plan is to build out its Adaptive Telemetry strategy by implementing Adaptive Logs, another Grafana Cloud feature that helps you reduce costs, and cut through the noise, by identifying commonly ingested log patterns and creating customized recommendations for dropping unused telemetry.

They also anticipate doing more with continuous profiling with Grafana Cloud Profiles, and ramping up with some of the AI/ML features in Grafana Cloud. Ultimately, they see the platform as being a “one-stop shop” for all their observability needs.

“At this point, we’re layering in all the tools and looking forward to how, in the future, AI can be laid on top of that and easily integrate with all the data we have,” the Senior Staff Platform Engineer concluded.

Importantly, they’re also confident that as they continue to build out their observability strategy, the Grafana Labs team will be there every step of the way.

“We’ve opened support tickets on the portal and it seems like we always get a response within 30 minutes,” said the former Senior Platform Engineer. “It’s been above and beyond what we’d expect.”