When Patrick O’Brien interviewed to become a Site Reliability Engineer at The Trade Desk™, it was clear that taking the company’s monitoring system to the next level was the priority.
“A chunk of my interview was about The Trade Desk’s previous monitoring system and how to scale it,” says O’Brien, who joined The Trade Desk more than two years ago. “I had a good feeling that would be an early task.”
“It was an area of our infrastructure that I immediately identified as needing extra effort towards solving,” adds Carl Johnson, who joined the Engineering Department as the Director of Infrastructure and SRE six months before O’Brien. “Patrick was hired with that as one of his core goals to improve.”
With good reason. The Trade Desk is a Software as a Service company which runs as a demand-side platform that represents advertisers and their ability to run campaigns on all forms of digital media across the internet – traditional display, mobile, audio, connected TV. Since its founding in 2009, The Trade Desk has grown into a publicly traded company with more than 1,100 employees and a market cap of $8.89 billion.
To maintain its massive success, “we have a global infrastructure that runs in both physical data centers and the cloud,” says Johnson. “We operate at a very high scale, dealing with request rates that are often measured in the millions per second.”
Previously, The Trade Desk “hosted everything for the storage layer of our monitoring system,” says O’Brien. “We had all our hosts pointing directly at various EC2 instances, and we had a high requirement for the disk storage layer.”
“The homegrown, self-managed, and hosted storage system that The Trade Desk previously used was extremely labor intensive and difficult to scale,” adds Johnson. “Often, individual nodes would run out of storage or, due to the technology’s single-threaded nature, would get overloaded. Developers and people at the company were just exasperated and annoyed with the unreliability of getting queries to complete or with missing metrics.”
“Things would fall over on a somewhat regular basis so our old system needed a lot of hand-holding,” says O’Brien. His goal was to alleviate that by making monitoring at The Trade Desk “easier, more reliable, faster, and cheaper.”
Turning on the Firehose
O’Brien spent time playing with open-source alternatives and also looked at other SaaS providers for hosting the backend. But Grafana Labs landed on his radar early in the process.
The Trade Desk was already using Grafana for data visualization because “what we need for our monitoring is flexible visualization available to not only all engineers at The Trade Desk, but also our entire company,” says Johnson. “We not only track traditional technical and engineering metrics in Grafana. We also present much-needed operational data that various business teams use to get a pulse on the day-to-day health of the business.”
“What Grafana has allowed us to do is be agile in how we manage these visualizations – whether it’s the scope of a single person working on a technology project or a Grafana dashboard that the whole company may be in the habit of viewing on a regular basis,” says Johnson.
O’Brien was familiar with Grafana Labs from the conference circuit and knew that it offered backend storage through Grafana Cloud, a fully managed SaaS metrics platform.
“I had a little concern about whether or not they would be able to handle the volume of metrics we were sending and querying,” admits O’Brien, who was transparent with the Grafana Labs about his hesitation.
As a trial run, O’Brien says that the Grafana team agreed to “let us turn the firehose on” for a week. “That was a very attractive POC they allowed us to do.”
The Grafana team assisted in setting up an environment in which The Trade Desk could have the initial stream of metrics forked off into two different streams – one internally and one to the Grafana Cloud. “We found a decent number of areas in some of the code base that needed some tweaking,” adds O’Brien, “and everyone on the Grafana Labs side was super happy to help out with that and get changes committed to help us proceed with the POC.”
The benefits of Grafana Cloud were almost instantaneous. “Query time immediately improved and many, many developers seemed to notice. Also, our reliability improved quite a bit,” says O’Brien.
Today, “we have zero storage nodes, which were the most expensive piece of that stack,” says O’Brien. “Now we just have three nodes and everything feeds back to Grafana Labs.”
Not only did the migration save the company money. The shift also spared the engineering department the headaches of troubleshooting. “Metrics usage frustration improved nearly overnight once we went with the hosted platform,” says Johnson. “The reason we know it was a success is those complaints and frustrations internally stopped.”
And, to their surprise, the compliments started coming in. “The person who originally set up our monitoring stack at The Trade Desk messaged me just to say how much faster everything is now and how much happier he was with it,” says O’Brien.
“He spent a lot of time on managing that system,” adds Johnson. “When you just add up the aggregate time savings if we had continued down that path, I think most of the ROI is really coming from time and labor savings. We can all say that what was once a time-sink was removed from our radar altogether.”
The Trade Desk’s Got a Brand New Stack
Now that engineers no longer had to focus on troubleshooting, they could hone in on building up The Trade Desk’s monitoring platform.
“By freeing up the capacity in our project load and staffing, it allowed us to think about raising the bar and being proactive about implementing a next-generation monitoring, metrics, and alerting system, rather than just maintaining the same system that had been in place for years and simply had momentum,” explains Johnson.
With the newly available resources, O’Brien refocused the team last year on streamlining the company’s stack into a more modern system. “2018 was the year of Prometheus,” says O’Brien.
“One of our goals was to be able to make metrics and alerting much easier to ramp up,” says O’Brien. “It’s nice that in Prometheus, your query language for dashboards is essentially the query language you write for alerts. And it’s super easy to embed a lot of context and a lot of helpful information into your Prometheus alerts, which for us was huge because we had to come up with some clever solutions in our old system to enrich the alerts themselves.”
Around the time that The Trade Desk decided to go with Prometheus, O’Brien met with the Grafana Labs team at GrafanaCon in Amsterdam and discussed “the one big question mark about our future: What do we do with long-term storage for metrics?”
Serendipitously, the The Trade Desk project coincided with the launch of Grafana Cloud’s native Prometheus integration in 2018, so the two companies collaborated again – but this time the partnership had “hurdles on both sides,” says O’Brien.
From the Grafana side, “we sent the most metrics per second that the Cortex backend had ever seen,” says O’Brien. “We probably took a year of VP of Product Tom Wilkie’s life away with the stress from the firehose we sent to Grafana Labs!”
From The Trade Desk’s perspective, they struggled with the common problem that often plagues new companies: how to implement processes around a new stack. “There was a decent learning curve and a lot of lessons we had to learn about how to structure our metrics, how to write our metrics, and how to collect metrics,” says O’Brien.
At GrafanaCon 2019, O’Brien gave a talk that outlined the six key lessons The Trade Desk learned from migrating its homegrown monitoring system to Grafana Cloud’s hosted Prometheus. (Read more in this blog post).
Overall “there was a parallel effort at The Trade Desk and at Grafana Labs to help each other meet our expectations,” says O’Brien. “The folks at Grafana Labs were immensely helpful with many different things outside of the long-term backend storage. They were also super helpful with Prometheus in general, often fielding questions, discussing and helping with bugs we ran into, or triaging issues.”
Now, thanks to templating in Prometheus and Grafana, all alerts must contain a link to a dashboard that provides context so that “if 30 hosts are alerting on something, it’s much easier to link directly to a dashboard that displays those 30 hosts and the past 24 hours of history to see if something funky had happened,” explains O’Brien. “We’re also starting to get into graphing when deploys happen so we have that context around it as well.”
Troubleshooting has become more unified as well now that The Trade Desk has been able to enforce linking a runbook to each alert. In the past, “sometimes that runbook would contain a link to a Grafana dashboard for more context, sometimes it wouldn’t,” says O’Brien. “Now that linking runbooks is required for every alert, we can better enforce written alerts which helps everybody.”
Engineering a Happy Team
While the team continues to work on retiring the old stack and iterating on the new one, the biggest gain from the Grafana Cloud migration has been the increased efficacy – and excitement – of the engineering team.
“Our engineers used to spend too much time fighting fires caused by our legacy platform. It was a huge win to give everyone their time back.” says O’Brien.
“I’ll say one indicator of success was that I heard a number of people say, ‘We’re working on Prometheus’ and giving an overview, and the response is, ‘This is really cool!’ I feel like you don’t hear this that often, especially on infrastructure teams,” O’Brien adds.
“It’s hard to please engineers,” says Johnson, “and our engineers have been quite pleased.”
“I have to give a shout-out to all the The Trade Desk engineers who pitched in to make everything work. It takes a village, and, by and large, everybody is excited about the direction we’re going in,” says O’Brien. “People really noticed that performance has increased and getting humans to tell you something has improved without prodding them, that’s a good indicator of a win.”