Event hero background image

Grafana at scale on Azure: How Backbase deploys customer stacks in minutes and empowers users to turn business data into faster decisions

Company: Backbase

Industry: Financial Services

Backbase is a global leader in digital and AI-powered banking software, serving banks, credit unions, and financial institutions worldwide. Its Engagement Banking Platform unifies customer and employee experiences from digital onboarding and retail banking to lending, investing, and assisted service, helping banks deliver seamless, customer-centric journeys. Headquartered in the Netherlands, Backbase supports financial institutions across North America, Europe, the Middle East, and Asia-Pacific.

Challenge

Backbase launched its managed hosting product line to offer banks a fully hosted, secure, and compliant version of its engagement banking platform on Microsoft Azure. As the business scaled to 70+ hosted banks, the team needed deep visibility across thousands of Kubernetes workloads and Azure services to maintain 99.9% uptime SLAs and compliance requirements. Their previous observability setup was complex, manual, and costly with limited automation for single sign-on, log filtering, or cross-environment visibility, and poor support for multi-tenant isolation and customer-specific configurations.

Solution

Backbase unified its observability on Grafana Cloud, integrating monitoring, alerting, cost optimization, and analytics into one automated platform.

  • Grafana Alloy + Kubernetes Monitoring Helm Charts: Simplified observability setup and scaling across 70+ customer environments, with modular configurations for application and infrastructure monitoring.
  • Azure Integration (Cloud Provider Observability): Automatically ingests Azure Monitor metrics and logs into Grafana Cloud, converted to Prometheus format for unified visualization.
  • Grafana Cloud Logs + Event Hub Pipeline: Centralized, filtered log ingestion through  Loki Forwarder, removing unnecessary logs while preserving compliance.
  • Grafana Cloud Adaptive Metrics: Optimized telemetry and resource costs, automating right-sizing decisions via Terraform.
  • Centralized Grafana Stack: Created a single pane of glass to monitor all customer environments in real time, essential during global cloud outages or incidents.
  • Granular aAccess Ccontrols (RBAC + Label-based): Enabled secure collaboration between Backbase and its banking customers while maintaining strict isolation.

“Grafana Cloud fits perfectly into our ‘automate everything’ strategy. It gives us control over costs, avoids surprises, and integrates seamlessly into our workflows.”

– Andrei Drumov, Systems Engineer

Impact

Migrating to Grafana Cloud has transformed how Backbase manages and scales observability for its managed hosting customers.

  • Achieved full-stack visibility across infrastructure, apps, and Azure services, supporting 24×7 uptime and DR readiness.
  • Reduced operational complexity and manual effort through automation with Terraform, Helm, and serverless telemetry flows.
  • Improved cost efficiency via log quotas, Adaptive Metrics, and OpenCost insights, preventing runaway telemetry expenses.
  • Enhanced customer trust and transparency with data-driven dashboards and centralized status views across global banks.
  • Strengthened partnership with Grafana Labs, benefiting from rapid feature delivery and ongoing technical collaboration.

“We can’t deliver 99.9% SLAs for 70 banks without proper observability. Grafana has been exceptional, both the product and the partnership.”

– Manu Chadha, Product Director

Manu Chadha, Backbase (00:00):

Hello. Hi everyone, myself, Manu will introduce ourself in the upcoming slides and we'll walk you through today's agenda. So we work at Backbase, which sells banking, digital banking software to banks. What we are going to speak about today is a bit of introductions about ourself. Then we would introduce what Backbase does. We'll talk about manage hosting from a technical perspective, but that's a product line which with which we allow the capability to the banks that we can host the Backbase software for them on cloud. Then we'll talk a bit about Backbase observability, and of course the reason for that is we are doing our observability platform with Grafana and we'll talk about a few takeaways and in the end if you have any questions, we are happy to take that as well. Let's talk about the introductions. Very happy to be here. Nice to meet you all myself, Manu. I'm the product director for Backbase manage hosting product line. I've been working with Backbase close to four years now, live in the Netherlands with my wife and kids.

Andrei Drumov, Backbase (01:16):

Thanks Manu. And I'm Andrei. I'm a systems engineer working in the observability team.

Manu Chadha, Backbase (01:25):

Like we saw on the first slide, right? AI powered banking platform. And this was in US. We are in the last Backbase Engage conference. We launched our AI product line and now as everyone is shifting to some AI back up products, similarly, Backbase is also transforming from digital banking platforms to AI powered banking. Okay, let me walk you through regarding the core value proposition of what Backspace does. So it's a unified banking suite end-to-end platform, which starts with onboarding and origination. These are the standard bank journeys with which you acquire customers. Then comes the digital banking side of things, which can be retail banking, business banking, lending, investing, all these different solutions. Then comes a human assist touch a bit on the AI side and then the activation and expansion, right? So a customer could start with one product line, but later on they would expand into multiple product lines on the same platform and that is what we call the unified banking suite.

(02:47):

So the whole journey starts with acquiring a customer, activating them, because once a customer signs on Backbase, it takes time to get them completely live on production because the whole end-to-end systems connecting to the core banking of different banks also takes some time and then comes the retention and the expansion part of the whole journey. Let's talk about engagement banking platform. So what we say here is that the traditional banking will not survive, so we need to rearchitect the entire banking around the customer. So make it customer centric and then comes multiple different things that need to orchestrate around the whole digital banking suite. So it all starts with enhanced security because we are talking about financial institutions here. It talks about different workflows, transactions, payments, approvals and things like that. So centered around the customer covering all their needs in our seamless experience, help bank excel and grow at every step of the customer lifecycle from acquiring, servicing, retaining, and cross-selling, and all of these things in one platform that allows bank to future proof their operating model.

(04:20):

One platform, any journey here what I was explaining that it all starts with delighting the customer on the left-hand side with digital onboarding, retail banking, lending solutions, two on the right side of digital assist and engage solutions, which empowers the employees of the bank to cater in a good way for the end customer needs. So we bring together all customer facing journeys and the employee facing journeys. So we continue to expand with our north star in mind. That is one platform, any journey, it's an end-to-end solution that seamlessly connects with our various ecosystem layers. Ensuring banks' digital transformation journey is future-proof and also smooth back manage hosting, Backspace, manage hosting, unburdens, the end customer or say the banks and credit unions from deploying the engagement banking platform, which is the core proposition on a battle tested Backbase reference architecture, Microsoft Azure environment manage hosting ensures that there is 24 cross seven monitoring and alerting and maintenance on the platform which we can deliver value to the customers.

(05:38):

So what we get with a managed hosting installation, so it's a dedicated installation for every bank. Isolation is key and of paramount importance for us because we want to ensure complete segregation between different customers. We host close to around 70 customers on Backbase managed hosting platform. And when I say customers, these are banks, credit unions, financial institutions, we guarantee a 99.9% uptime. That means all our deployments are spread across all three availability zones provided from the cloud provider. And we also run a replica on the DR side always running and the infrastructure is completely scalable. Doesn't matter how many active users or the behavior changes every second are infrastructure is designed to adjust itself and responds automatically with respect to unexpected events that might occur. So like I was mentioning, security is very important to us as these are banks and we have the principle of zero trust architecture.

(06:50):

It's embedded in our way of working instead of treating certain sources as trusted, not trusted, we never trust anything, we just verify anything and everything. Compliance is a very important part for our customers, for our banks because we adhere as part of our contractual obligations that we will provide them with a SOC two attestation every year. Database backups are always there. They are running in the DR region in all three availability zones and we provide connectivity to on-prem because the legacy systems, the core banking systems, we connect via site to site VPN based connectivity to them talking about the geographical deployment locations, of course primary location, all three availability zones running in active active and in the disaster recovery region we always have the databases running. So it's only a matter of time you need to point your application in case of a DR event. And to be honest, I can tell you in order to adhere to our RTO and RPO, it is very crucial and essential that all the eyes are on the system of the observability platform. You have a thorough monitoring and alerting as soon as there is event you need to make a choice then and there that you need to point the application on the DR side. And then I'll hand over to Andre to talk about the managed hosting technical aspects.

Andrei Drumov, Backbase (08:19):

Thank you, Manu. So we talked about managed hosting as a product, but what managed hosting is in fact from the technical point of view. So from a technical perspective, managed hosting is just a bunch of infrastructures, code modules surrounded by some additional automation. And this recipe allows us to provision customer environments in a matter of minutes double digits. At core of our managed hosting solution, we have Terraform, we manage everything we can with Terraform. Most of our code base is developed in Terraform, in Terraform, and recently we migrated to Open Tofu for obvious reasons, our infrastructure is hosted on Azure. The back based services applications are deployed and hosted on Kubernetes. We are using GitHub actions to orchestrate the managed hosting Terraform modules deployments across all the customers. And once the infrastructure part is deployed, we are using ROCD to deploy the Backbase services stack. PagerDuty is our IRM system and Grafana Cloud is at core of our observability platform. So let's talk a bit more about observability.

Manu Chadha, Backbase (09:35):

Yeah, observability as a platform is very, very crucial to Backbase manage hosting offering because what we are providing to our customers, the uptime and the SLAs we have that always need that we have deep visibility into the application performance, infrastructure, health because we are providing infrastructure as a service to our customers and a seamless user experience. By this, I would break the entire observability into two different segments. One is the technical aspect, which is the infrastructure monitoring, and second would be the business side of things. On the business side of things, I'll just try to explain it with an example. Mostly if a bank, CTO, they want to have insights about their end user base. What they would like to understand maybe on a daily basis how many users are actively logging in, whether they're logging in with which channel, whether it is web or mobile, if it is mobile, iOS or Android.

(10:44):

Are there login failures happening because of biometrics something wrong that will give them key insights in order to invest in the right journey, right step of the product offering in order to find out where they're lacking and that will also help them expand their business. Let's talk about a very small simple scenario. As of all of you guys would have experience with any banking platform, you would have gone to become a customer or apply for a loan and then at some point at different stages in multiple questions you lose interest or you drop off. And then the bank really wants to know at which stage did the end user drop off. So that was a customer they could have retained but they dropped off. So these are the few business dashboards with efficient metrics which we create and offer them to our customers and that helps our customers, which are banks, expand their customer base. And we also of course 24 across seven incident management we are running and we heavily rely on Grafana as a product.

Andrei Drumov, Backbase (11:58):

So let's talk even more about observability at Backbase. In our scenario, we are dealing with two major sources of telemetry signals. First one is are the Kubernetes workloads running on EKS? And the second one are the logs and metrics generated by the Azure services that we are using on Kubernetes. We are collecting the telemetry signals using Grafana Alloy and we are using the Kubernetes Monitoring Helm charts developed and maintained by Grafana to manage the Grafana Alloy deployments. We chose this path because of its simplicity, flexibility and modularity. It's relatively easy to manage. It allows us the flexibility to extend the configurations in case we need some custom integrations and it also allows us to conditionally enable or disable certain features based on certain customer needs because some customers may need application availability, some other customers may need continuous profiling, some customers may need both. And with Kubernetes Monitoring Helm charts, it's really easy to juggle with this features.

(13:21):

On the Azure telemetry side of things, the situation is a bit different. We are collecting the Azure services metrics using the Grafana Cloud Cloud provider observability of the Azure integration. It scrapes the Azure Monitor metrics and probably some other stuff from Azure Resource Graph and transforms them into Prometheus format stores them in the Prometheus data source for log ingestion, we are using a pipeline composed of diagnosis settings configured on each resource level. The diagnosis settings are pushing the logs to event hub and from Event hub these logs are being picked up by the Lucky Forwarder function, which is also developed by Grafana. It's open source and it takes care of sending the logs to their Grafana stack logs and point off the specific customer. And probably important to mention that we are using the serverless versions of both metrics scraping and log streaming because we want to decouple these flows from our physical infrastructure.

(14:44):

As Manu mentioned, security is of high importance for us as for any other organization and most of the security concerns are mitigated by completely isolating the environments of each customers. But we still feel important to share some highlights of our experience addressing some security aspects using Grafana Cloud. The first one is single sign-on. It's a well-known concept. Probably everyone of you is using single sign-on to access your Grafana stacks. Yet on our previous observability provider, it was a pain to automate the setup of single sign-on. We had to raise support tickets with our previous provider support and ask them to implement some changes that would serve as a prerequisites for our single sign-on configuration automation. And it took weeks sometimes for them to implement those changes. With Grafana, we don't have these limitations at all. Another aspect is the user access. In our scenario we are dealing with different stakeholders.

(16:03):

So Grafana Cloud stack and the access by both Backbase employees who are supporting the applications and from people from the customer organization. And each of those stakeholders may require different level of privileges. And we are managing these privileges with the combination of role-based access and label based access control when running a platform. At scale research optimization is of even higher importance and we're using the Grafana Kubernetes observability to get insights about our resource allocation to the Backbase services workloads. It helps us identify cases of severe over provisioning and we are using this insights for further optimization. This is an ongoing process because there is no one size visible all configuration. We have customers of different volumes. Some customers may require less resources, some customers may require more resources. This is where this feature helps us. Another dimension or another source of insights which drive the optimization is the open cost integration with Azure billing, which gives us accurate cost of an understanding of the price of running our workloads. And sometimes seeing currency instead of CPU or megabyte, it can be sobering.

(17:49):

Another dimension for optimization is the cost control or better say cost right sizing. Historically we've been dealing with quite some expensive telemetry signals and one of the most notorious ones are the logs generated by Azure services due to different reasons, including compliance. We have to collect quite a lot of quite high volumes of Azure logs. And previously we didn't have a way to drop everything that we don't need because for our use cases, quite often we need only a fraction of the logs that we collect. And with Grafana Cloud and with Grafana Cloud, we are using the Loki event. Forwarded to some extent optimize this flow. So it allows us to filter out some unnecessary logs.

(18:59):

Another approach we use is Adaptive Metrics. We like this feature a lot. The biggest thing we like about it, well besides the cost optimization, is the fact that it has a Terraform provider. So for us it's pretty easy to automate the application of recommendations provided by open cost and also applying some custom rules. And the last feature that we use, maybe some of you are not aware of it, are the log quotas or log ingestion policy. So for us it is important to set up hard caps on log ingestion for different environments. For instance, for lower environments to prevent sudden bursts of logs, for example, when someone forgets to disable debug logs and leaves them running for a few days and maybe running some tests. So to mitigate these risks, we are using this functionality where we are setting up hard cap on log ingestion per environment. And actually it's quite flexible, it's much more flexible than on our previous observability provider and it allows us to set up ingestion policies on label base.

(20:24):

As a managed hosting team, it is important for us to be able to quickly understand and see the current state of the platform across all of our customers. So as previously mentioned, we are running quite a lot of individual stacks for each customer and it's simply not feasible to log into each one of them to retrieve some information. And for this, we are actively implementing the concept of centralized Grafana stack with a really simple idea of reading the metrics from the spoke Grafana stacks and using them to build a single pane of glass dashboards. Even today, if you guys are not aware, or I don't know if some of you is using Azure, Azure had quite a nasty outage today on front door service and we actually used our centralized Grafana stack to validate the status of our customers.

Manu Chadha, Backbase (21:33):

One of the major reasons I missed most of the sessions because I was just explaining to each and every customer, every bank of an outage that is from the cloud provider, it is very difficult to explain because for Azure also it takes time before they do first round of analysis and then update the status page. So this really helps because then we have a centralized view to check out the latency for every customer. And our customers are deployed in different regions. We host from us, Europe, middle East, Asia, Pacific, Australia, everywhere.

Andrei Drumov, Backbase (22:07):

Yeah, it's not a silver bullet, it's not super flexible, but still it already delivers quite a lot of value for us. And the nice part about it, when working with metrics, it comes with no additional cost. We don't need to reingest any telemetry, we are just reusing what we already have. And to conclude our experience with Grafana Cloud, especially in the context of migrating from another vendor, proved to us that it fits really well into our automate everything scenarios. It provides ways to control the costs and avoid surprises. And of course none of this would've been possible without the exceptional technical support that we receive on a daily basis from the Grafana team. So thank you from the observability team to the Grafana support team.

Manu Chadha, Backbase (23:03):

Yeah, overall, I'll try to sum up the partnership with Grafana, right? Andre has been quite modest in explaining the whole stack, but we are talking about having 150 organizations in Grafana with around 1500 users. So anytime anything can really explode and shoot the bill. But to be honest, the partnership with Grafana has been really nothing short but exceptional. We started our engagement in 2024 where we ran a POV for seven months and then we finally agreed that few of the things which are necessary for us to move to Grafana were delivered actually before time. And how I call it that the pre-sales discussions with any vendor is like a honeymoon period and then comes the post-sales experience. But I can really call out that the post-sales experience with Grafana has been great. They have engaged with us at every step and made the entire journey quite seamless. And it was done quite in collaboration. And to be honest, it's a very crucial product for us because having the entire incident management and keeping eyes on 70 banks is not easy. And we really cannot deliver that commitment of SLAs and uptime without having proper observability. And we need to architect that as well around a good product. And Grafana has delivered that. So a good partnership in our opinion. Thanks. Thanks everyone. Yep.

Speakers