Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

How Dropbox rebuilt its logging stack with Grafana Loki after a data center went dark

How Dropbox rebuilt its logging stack with Grafana Loki after a data center went dark

2025-06-27 5 min

Two years ago, a power outage knocked a Dropbox data center offline.

It wasn’t just any data center. It was the only one where Dropbox hosted Grafana Loki, meaning engineers couldn’t access their log data.

“We had considered a data center outage when we were rolling out Loki, but it had just never risen up in priority enough to get put into multiple data centers,” said Chris Hodges, an infrastructure software engineer at the cloud storage company. “And now we were paying the price.”

The incident became an inflection point for Dropbox, which evolved that single distributed Loki cluster into a reliable petabyte-scale logging platform—all while balancing developer needs and operational realities. As a result, Dropbox is now able to ingest up to 6 GB of logs per second, with as much as 5 PB in storage at any given time as part of their expanded 30-day retention policy.

“It was a good learning experience for us,” said Hodges, who discussed that journey and shared strategies for operationalizing Loki at scale during his GrafanaCON 2025 talk last month. “I wish we’d been better prepared for it, but it really gave us an idea of what we needed to do to make this better in the future.”

‘A single place for observability’

Dropbox originally selected Loki for multi-tenant log aggregation during its migration to Kubernetes. Developers had previously relied on manual, host-by-host logging methods like SSH and VI, or transferring files off hosts. That wasn’t going to scale with Kubernetes, where logs—especially pod logs, which are deleted when pods are rescheduled—are much more ephemeral.

Loki is based on object storage, so it offered a more persistent solution. Price, multi-tenancy support, privacy controls, and reliability were additional selling points, as was Dropbox’s longtime use of Grafana Labs’ projects and products in other areas.

“We definitely wanted to have a single place for observability: logs, metrics, and traces,” Hodges said. “And Grafana just seemed like it was going to continue to be the best solution for that.”

Planning and design 

The outage made it clear that Loki needed to be in multiple regions to ensure its availability, even in the event of a complete data center failure. Dropbox developed a failover strategy where DNS would redirect the log shipping agent Promtail to Loki, so logs to be sent to a different region in case of disaster.

The system also used a shared object storage layer across regions, with Amazon S3 offering availability zone redundancy and Dropbox’s Magic Pocket providing on-premises, cross-region storage replication. This combination eliminated the need for duplicate logs, which would have been cost-prohibitive at petabyte scale.

Dropbox designed a playbook for simulating this failover without disrupting users. During these tests, Hodges and team discovered that logs buffered in memory didn’t immediately flush to storage, creating the potential for data loss. Engineers manually triggered a flush, but that caused a disruptive torrent of writes. The experience reinforced the importance of cross-team coordination, rate-limiting safeguards. and testing before deployment.

“I’d rather the first time we run this not be when a data center is down,” Hodges said.

A gradual rollout

From there, Dropbox took a measured approach to deploying Loki more widely. Promtail gradually rolled out to 10,000 to 20,000 hosts at a time as the team watched for ingestion and disk usage spikes.

One problem that emerged was high cardinality. Each log line included a set of labels, such as the service name and host, that formed a unique identifier. But too many label combinations on a host threatened to clog the ingestion pipeline and cause out-of-memory crashes. Dropbox addressed this issue by imposing tight controls and standardization on its label schema.

Strict limits—including seven-day retention, six-hour query windows and both global and per-stream rate caps—further helped maintain stability as adoption increased. Still, some services generated so many logs that they risked overwhelming the system. Dropbox isolated these workloads using stream-level controls and exponential backoff in Promtail. Additionally, custom pull requests allowed one service to be heavily throttled while others could continue sending logs, which solved the problem of head-of-line blocking.

“A core tenet of multi-tenancy is that one badly behaving tenant doesn’t ruin the experience for everybody,” Hodges said. “This was an issue that very nearly kept us from rolling out Loki to production.”

A trusted system

As the system proved itself, Dropbox relaxed some restrictions, such as by increasing retention to 30 days. The company also worked with Grafana Labs to switch its back end from BoltDB to a Prometheus-style time series database.

“They saved our bacon,” Hodges said. “We saw an order of magnitude improvement in our performance on label queries.”

A Grafana dashboard shows how Dropbox has grown its log igestion rate over time

Today, Loki has become a trusted system across the entire company—so much so that Dropbox shut down its legacy logging system. Its shared storage and DNS-based region switching enable fast failover for high availability. And it operates at a massive scale as a core part of Dropbox’s observability stack. It now ingests 5 GB to 6 GB of logs per second, with 30-day retention totaling 4 PB to 5 PB at any given time.

That steady growth—from a single-region cluster to a resilient platform—is one of the main reasons why “we’re excited for what’s coming forward with Loki,” Hodges said.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!