The Trade Desk: Lessons We Learned Migrating from Homegrown Monitoring to Prometheus

Published: 24 Apr 2019 RSS

The Trade Desk provides a self-service, cloud-based platform for buyers of online advertising. Since its founding in 2009, TTD has grown into a publicly traded company with more than 900 employees and a market cap of $8.89 billion.

The company recently moved from an old monitoring system based on Nagios, Graphite, and a number of homegrown pieces of software, to something more standard, based on Prometheus. SRE Patrick O’Brien gave a talk at GrafanaCon about the lessons they learned along the way to processing 11 million requests per second with Prometheus.

1. Think about your (hard) alerts.

When migrating alerts defined in a legacy alerting system into a new system, O’Brien said, “90% of those alerts will be insanely easy to move over. It’s the remaining 10% that will be difficult.” O’Brien’s advice: Spend time figuring out which ones will still be useful in the new system, and how you’ll actually migrate them. “Oftentimes, especially coming from Nagios, we’ll have Python scripts that do many different things in that single script to kind of figure out if there is an issue,” he said. “Those are the hard ones, and that’s where your longest tail of the project will be.”

2. Prometheus documentation is clinical.

“I’m super happy to now hear that we can contribute better documentation,” O’Brien said. “You will get a lot of PromQL questions when you start rolling up Prometheus, and it’s best to kind of become an expert in that as much as possible.”

3. Do maths.

“We immediately hit cardinality issues because we have a lot of hosts,” O’Brien explained. Users were told to make metric names generic and not embed any metadata into them, but add labels instead. “We hit 2 million metrics in the single namespace in like 30 seconds,” he said. “It was terrible and it was very painful… so maybe embed some metadata in that metric name.”

4. Find a few internal evangelists.

O’Brien gave a shout-out to one TTD engineer, Nathan, who “knew many more developers than I knew, and so he was able to kind of work with them, show them in code how it works, show them the benefits, and was able to reach much further than I was able to reach. It was fantastic.”

5. Create a dedicated team.

“The more opinions on how to do something, the better,” he said.

6. Get involved in the community.

“This one kind of speaks for itself,” O’Brien said. “You learn more about the product, you learn more about the project, and you’re able to help everybody else out.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Related Posts

SevOne, a 14-year-old company in the network monitoring space, started using Grafana for self-monitoring and fell in love with it. Now they've developed a feature to live stream data sources.
The KubeCon + CloudNativeCon caravan heads back to Europe this month, bringing an expected 10,000 cloud native enthusiasts to Barcelona’s Fira Gran Via. Already registered and packed your bags? Here’s where you will find Grafana Labs team members during the conference.
In this installment of the grafana-polystat-panel plugin tutorial, we look at rolling up multiple Cassandra clusters and tying together multiple dashboards.

Related Case Studies

DigitalOcean gains new insight with Grafana visualizations

The company relies on Grafana to be the consolidated data visualization and dashboard solution for sharing data.

"Grafana produces beautiful graphs we can send to our customers, works with our Chef deployment process, and is all hosted in-house."
– David Byrd, Product Manager, DigitalOcean

Hiya migrated to Grafana Cloud to cut costs and gain control over its metrics

To scale Prometheus, says Senior Software Engineer Jake Utley, Grafana Cloud was ‘the most in line with what we wanted to accomplish.’

"We wanted the ability to look at our own information and understand it from top to bottom."
– Dan Sabath, Senior Software Engineer, Hiya

How Cortex helped REWE digital ensure stability while scaling grocery delivery services during the COVID-19 pandemic

Cortex’s horizontal scaling has been crucial; reads and writes increased significantly, and the platform was able to handle the added load.

"We wanted a software-as-a-service approach, with just one team that provides Cortex, which can be used by all the teams within the company."
– Martin Schneppenheim, Cloud Platform Engineer, REWE digital