How Grafana helps PingCAP troubleshoot TiDB deployments

Founded in 2015 by three infrastructure engineers, PingCAP is the company behind TiDB, a hybrid transactional and analytical processing (HTAP) database designed for horizontal scalability, strong consistency, and high availability.

Growing pains

With more than 13,000 stars and almost 200 contributors on GitHub, TiDB is one of the most popular open source database projects in the world – and it’s continuing to grow.

In fact, about two years ago, PingCAP found that its system had grown so much that it needed a strong monitoring solution. “We had hundreds of metrics to monitor our system and we needed to make sense of them all in order to effectively diagnose issues in the cluster,” says Queeny Jin, Head of Internationalization at PingCAP.

The team experimented with StatsD, but “gave up quickly” and switched to Prometheus to collect its monitoring and performance data. “Prometheus is powerful and simple to use, and it’s written in Go, so we can contribute to it easily,” says Queeny. As Prometheus’ preferred visualization layer, Grafana was the easy choice for displaying the system’s metrics. “It’s natural and simple to use because this Prometheus-Grafana combination is a popular monitoring stack,” says Queeny. “We use Prometheus to pull our metrics from the TiDB database cluster and then display them in Grafana.”

In most cases, the users rely on Grafana’s intuitive and informative dashboard for insights instead of looking at detailed logs.
Queeny Jin, Head of Internationalization, PingCAP

Getting to the heart of the problem quickly

With Grafana in place, Queeny says, troubleshooting during deployment has become much simpler. “It is helping us every day to diagnose and troubleshoot issues of TiDB clusters for our users,” she says. “Basically, we collect all the important metrics into an overview dashboard, so we can easily pin down any issues that arise. In most cases, the users rely on Grafana’s intuitive and informative dashboard for insights instead of looking at detailed logs.”

And that’s a huge advantage, because a typical TiDB cluster contains more than 1,100 metrics for different components, including the TiDB server, the TiKV (the open source distributed transactional key-value store that powers TiDB) server, and the Placement Driver (the metadata layer that manages and schedules TiKV nodes) server. A single cluster of 89 servers with 91 services in PingCAP’s testing environment will generate 179G metric logs over 15 days. “It would be impossible for us to go to each server and look through all these logs,” says Queeny. “And that’s where Grafana is critical to us and works like a charm, because it makes sense of all these logs.”

In addition to the easy-to-use dashboards, PingCAP values Grafana’s versatile API. “It allows us to build our own tools,” says Queeny. Grafana’s plug-in architecture makes it possible for anyone to build custom panels and data sources to extend Grafana, or even improve upon pre-existing dashboards. In fact, Queeny says, “inspired by Grafana reporter, we built our own tool to generate a PDF report from Grafana.”

The team’s favorite feature is Row, which it uses to group metrics. “In TiKV, we built an Error Row and summarized all the important error metrics into this Row so we can identify the issues very quickly,” says Queeny.

For those reasons, Queeny says, “Grafana is our first and only choice as the GUI display of our system’s metrics. It works great for our users.”