There are transparent companies – and then there’s GitLab.
“GitLab is a ridiculously transparent company,” said Ben Kochie, a Staff Backend Engineer for Monitoring at GitLab. “When GitLab has a database outage, we live stream the recovery on YouTube.”
GitLab worked with Grafana well before Kochie joined the company. “We used it internally – but part of our transparency is we want to be able to show our internal performance metrics to all of our users and customers.”
Starting with Ruby
In the beginning, GitLab set up a public version of its dashboard with a simple synchronization Ruby script that used the internal Grafana API. “We dumped all the dashboards and then loaded them to the public facing one,” said Kochie.
By implementing “really ugly, simple Ruby,” the team created two independent Grafana servers – one with internal authentication and one with no authentication. At the start, “that worked pretty well,” said Kochie. “It gave us all the ability to synchronize dashboards.”
In order to isolate its public dashboards, GitLab had an internal Prometheus and an external Prometheus that used the federation endpoint to scrape all the data over.
“If anybody’s ever done that, you know it doesn’t work very well. It doesn’t scale and it kind of sucked,” admitted Kochie. “We ended up saying that’s too slow.”
Their solution was to set up a third Prometheus replica that would scrape the same targets to complement its HA pair.
“That worked pretty well until that Prometheus got too big,” said Kochie. “We did the typical thing that you do to scale Prometheus: We started some horizontal sharding. So now we have our internal main Prometheus and an internal application Prometheus. Now we’re getting more complicated, and this is getting really annoying.”
The next thing Kochie’s team introduced was the Thanos query layer, which provides a simple, single pane of glass to query all of the company’s Prometheus servers.
“We also started limiting the queries because we were now using direct access to our internal Prometheus servers,” said Kochie. “We had to set significantly shorter query timeouts to avoid abuse,” implementing a Grafana limit at 30 seconds and Thanos at 5 seconds.
“That was still not super great because people could still DDoS our Prometheus servers with super long and complicated queries, and it would just use up all the memory and OOM our internal Prometheus,” said Kochie.
But with the release of Prometheus 2.5 came the abilty to limit the number of samples per query. While the default is roughly 50 million samples for queries, GitLab cut that down to 10 and significantly reduced the damage a bad query can do to its Prometheus server.
Add a Little Trickster
GitLab’s monitoring stack also includes Trickster, a Prometheus-specific reverse proxy caching server that time aligns queries to the nearest minute, five minutes, or 30 minutes in order to make them more cacheable.
So if you query for the last hour’s worth of data and then five minutes pass and you query for the last hour’s worth of data again, Trickster knows that it’s got 55 minutes of cache data and only sends a query for the last five minutes, then fills the cache with that data.
“It really improves the performance of the dashboards over time,” said Kochie. At first, “the code was not so stable. But because it’s open source – thanks, Comcast! – they were super friendly, and we contributed a bunch of upstream bug fixes to make it more stable and faster.”
Cardinality: A Cautionary Tale
Kochie warns not to forget about cardinality because “there are things that can go wrong.”
Within GitLab, there is a feature flag that allows users to get metrics on a per project basis – every user, every group, every project. “It’s a useful label – but not on a public server where we’re hosting people’s private data,” said Kochie. “You could leak that data into your metrics and end up getting private information in labels.”
While the feature is off by default, “we had an over-eager developer turn that feature flag on, and it was a huge cardinality explosion,” admitted Kochie. “We have millions of repos on GitLab.com and we had to turn that off. It blew up our Prometheus server and was potentially leaking private information through our public metrics.”
Thankfully the error was caught within an hour of that feature getting turned on, and the team leveraged the delete endpoint to get rid of all the data. “Nobody in the public knew that this was available, and I was able to delete all the data before it got queried,” said Kochie.
Dashboards for All
“Now we have users who just love to pound on our dashboards, especially when we have database outages and somebody posts a link to a dashboard on Hacker News,” said Kochie. “Or even more fun, a very large corporation buys our biggest competitor, and there’s a graph in our dashboard that shows the number of imports that happened from our competitor. That dashboard just gets completely slammed.”
So how does the GitLab team plan to improve the performance of its public dashboards?
“To improve efficiency, I want to add the Thanos storage engine and add downsampling. That’ll make things even faster for everybody,” said Kochie. “Then you’ll be able to do super big queries on our public dashboard without crushing our backends.”
Not that Kochie minds the attention. “It’s super fun to be able to share all of our public dashboards,” he said. “It’s awesome.”
Check out GitLab’s public dashboards here.