Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

We cannot remember your choice unless you click the consent notice at the bottom.

How CERN uses Grafana and Mimir to monitor the world's largest computer grid

How CERN uses Grafana and Mimir to monitor the world's largest computer grid

2025-01-02 5 min

The European Organization for Nuclear Research (CERN) is famous for operating the world’s largest particle accelerator, but did you know that CERN is also at the heart of the world’s largest computing grid? And with such unprecedented computing demands comes some serious observability needs.

CERN operates two data centers at its facility near Ganeva, Switzerland, which includes 11,000 servers, 470,000 cores, and 380 PB of archived data. But CERN also wants to make that data available for scientists around the world to analyze, which is why they coordinate the Worldwide Large Hadron Collider Computing Grid (WLCG). The WLCG extends to 170 institutes and universities across 42 countries, bringing the combined infrastructure to 1.4 million computing cores and 1.5 exabytes of available storage, generating 2 million tasks per day at a global transfer rate of 260 GB/s.

In other words, it’s a lot of data.

“Of course, such infrastructure requires monitoring, and this is where the monitoring service itself comes into place with our main mandate to monitor the data center, WLGC, but also our services that are running in our IT department,” said Nikolay Tsvetkov, Staff Computing Engineer at CERN.

CERN uses Grafana and Grafana Mimir as part of a scalable, fault-tolerant stack that makes sure the 12,000 physicists who rely on the WLCG have access to the data they need.

“Grafana is crucial for the monitoring service at CERN, providing us with many, many useful dashboards, but also with alerting functionality,” Tsvetkov said, noting that the recent addition of Grafana Mimir “provides the missing bit” in their Prometheus architecture.

Tsvetkov works on the IT Monitoring and Data Streaming services at CERN and also serves as the service manager for their monitoring visualization tools. He shared his experience at GrafanaCON 2024, highlighting how Grafana and Grafana Mimir help CERN and the WLCG operate at a truly massive scale.

Want to tell your story at GrafanaCON, our largest community conference of the year? We’re looking for speakers to share their real-world experiences with Grafana, custom-built plugins, cool dashboards, and more. This is a community-driven conference focused on your favorite visualization tool, its big tent of data source plugins, and the surrounding open source ecosystem — Prometheus, Loki, OpenTelemetry, Mimir, and more.

How CERN uses Grafana to visualize its data

Tsvetkov’s team monitors roughly 15,000 hosts and receives around 85,000 documents every second, which translates to 3.3 TB of data stored every day.

“Having such a wide range of services to monitor, of course provides some challenges for us,” Tsvetkov said. “The first one is that we get data from heterogeneous data sources. We need to actually unify the way we get this data so we can cope with this.”

In 2016, CERN adopted Grafana as its primary monitoring interface to address that issue. Today with more than 5,000 Grafana users across 70 organizations creating more than 2,700 dashboards and 680 alert rules, it’s safe to say, “our users, they love it,” said Tsvetkov. “Of course, they’re creating a lot of dashboards, but we found that Grafana is quite useful as a unified data source.”

That’s because Grafana provides a proxy, so some of CERN’s systems are accessing data from Grafana data sources through the Grafana APIs. “It actually hides the complexity of the different databases that are behind the data sources,” Tsvetkov said.

A Grafana dashboard for ETF tests

Their Grafana deployment is split between a public instance that’s available to anyone (hosted on two VMs behind load balancers in different availability zones), and private instances that sit behind CERN’s SSO (multi-organization, with six servers behind a DNS load balancer and VMs in three availability zones.)

Managing all those dashboards can present some operational challenges, so they rely on the organizations feature in Grafana. Public organizations need to be available to all Grafana users, so Tsvetkov’s team runs some scripts to synchronize them regularly. The private organizations are mainly used by service managers, so those managers are given the flexibility to decide who should have access.

Grafana dashboard for monitoring CERN's OpenStack environment

Even managing plugins became easier: While physicists often created custom plugins for Grafana, in recent years, this hasn’t been an issue. “Grafana has became a more and more complete product, and we are getting anything that we need almost out of the box,” Tsvetkov said.

Grafana Mimir: the missing piece

From the beginning, the team was aware of a lack of support in their stack for users who wanted to integrate long-term storage for Prometheus metrics. They began working on a solution in 2020 using InfluxDB as a backend, but it was clear that it wouldn’t scale well in the future, Tsvetkov said. They ultimately turned their attention to Mimir, our horizontally scalable, highly available, multi-tenant TSDB for long-term storage, and began testing how it would suit their use case.

With 80 million active series, they found Mimir could scale to meet their needs.

“Also, we liked the fact that you can scale at a different component, either at the ingestion site or the query site,” Tsvetkov said. “Mimir also provides a lot of flexibility, with multi-tenancy and getting data from, from different users, so this would also allow us to integrate all the data into a single storage.”

They began a pilot service in 2023 running on their Kubernetes deployment, which had 49 nodes, including 46 worker nodes that provide more than 1.2 TB of memory and 730 CPU cores, as well as different node groups to address the different hardware requirements. They also used Amazon S3 as their object storage, with 40-day default retention.

Kubernetes monitoring with a Grafana dashboard

In Mimir, they were running 29 tenants, producing 30 millions of active series. By their calculations, those tenants could have accommodated 150 million active series and processed around 2,000 queries per second.

As a result, “now Mimir is well integrated within the pipeline,” Tsvetkov said.

Grafana dashboard for CERN's distributors

Going forward, more open source tools are under consideration for CERN, including Open Telemetry and Grafana Tempo for tracing.

But no matter how big the computing grid becomes, the foundation for their infrastructure is firmly rooted in Grafana.

“It has been with us since the beginning of the architecture,” Tsvetkov said, “and we have always been very happy using Grafana.”

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!