When you’re a sports betting technology company and you realize your in-house, on-prem Graphite solution for monitoring metrics is no longer a sure-thing, what do you do? That was the dilemma at Kambi, a quickly growing business – with a passion for using open source technology – that has about 500 different micro services in production and around 200,000 incoming metrics messages per second.
During a presentation at ObservabilityCON 2021, Kambi site reliability engineer Frank Stengård recounted the story of how his company outgrew Graphite, created a small open source Graphite firewall tool called Hadrianus (named after the famous emperor) to handle some vexing issues, and successfully migrated “a horrible amount of data” to Grafana Cloud.
The past and the problem
Stengård began by outlining the “pretty standard” setup for Graphite that Kambi had used historically. It was based on Python, with around 500 services feeding into an HAProxy that divided the load between six instances of carbon-relay (an amount picked for CPU scability purposes). Carbon-relay nodes then forwarded it to the carbon-cache nodes, which stored the actual data as whisper files.
After a while, issues began popping up. Not only was Kambi dropping metrics, but their disk space, CPU, and even a little bit of RAM were running out. “I basically could not log on to the carbon-cache nodes sometimes because they were just not responsive enough,” he said.
Even after replacing carbon-cash with go-carbon, most of the problems remained and were joined by a few new ones, including having their disk I/O run out. The time and effort it took to keep everything up and running basically amounted to a half-time position.
Stengård’s team discovered that in Graphite, many metrics were being sent at more frequent intervals than they were actually stored at, and the values of the metrics were zeros or mostly zeros.
Finding a solution
Scaling out Graphite wasn’t an option as Kambi was running it in an on-prem data center and didn’t have the resources. “Also, expanding a classic Graphite cluster without having a long downtime — especially if your cluster is really over-burdened — is hard,” Stengård said. “We could not figure out the easy way to do it without affecting our [50+] development teams.”
But they couldn’t give up, either, since the issue affected how Kambi could view its systems and how its environments were doing." As Stengård put it, “The house was burning now. We needed to fix it.”
Kambi created its own software solution: Hadrianus, an open source “application-aware firewall load-balancer.” One of the key features was that it was able to mirror traffic to multiple Graphite clusters and has an allow list that lets critical metrics through even if they violate Kambi’s logic.
Hadrianus’ most important functionality, however, was that it could drop incoming metrics that didn’t conform to the Graphite line protocol and handle Graphite endpoints that come in too fast or are meaningless in some way.
With the new software in place, Stengård said Kambi achieved an 80% reduction in load. Disk I/O was still high but it was more manageable. Memory utilization was still high, but, importantly, he noted that “things didn’t just randomly die because they ran out of memory.”
Still, with Kambi continually growing — meaning more metrics being generated all the time — the existing on-prem solution (while a nice band-aid solution) wasn’t going to be enough in the future.
Since the company couldn’t scale Graphite, Stengård and his team decided to modify Hadrianus to be able to send the data in a mirror replica to a third-party provider like Grafana Cloud. The company also wanted to start using Prometheus instead of Graphite, he said, because it has more popular support and works slightly better in Kubernetes.
Kambi tested out Grafana Cloud’s metrics abilities by basically sending its whole production data load straight to Grafana Cloud to make sure that they could handle it — and it did!
Success with Grafana Cloud
Stengård explained that Grafana Cloud ended up being a good fit for Kambi because it supports Graphite, and it has a good migration path to Prometheus, which can be done gradually. The company already uses Grafana, “so putting it in the cloud would be almost an identical user experience for our dev teams,” he noted. He added that in the future, Kambi’s site reliability engineering team is hoping to replace their existing Elasticsearch solution in AWS with Grafana Loki.