How a production outage in Grafana Cloud's Hosted Prometheus service was caused by a bad etcd client setup

Published: 7 Apr 2020 RSS

On March 16, Grafana Cloud’s Hosted Prometheus experienced a ~12min partial outage writing to the London region of our Hosted Prometheus service resulting in delayed data storage without loss. To our customers who were affected by the incident, I apologize. It’s our job to provide you with the monitoring tools you need, and when they are not available, we make your life harder. We take this outage very seriously. This blog post explains what happened, how we responded to it, and what we’re doing to ensure it doesn’t happen again.

Background

The Grafana Cloud Hosted Prometheus service is based on Cortex, a CNCF project to build a horizontally scalable, highly available, multi-tenant Prometheus service.

In January 2019, we added the ability to send metrics from pairs of HA Prometheus replicas to the same Hosted Prometheus instance and have those metrics be deduplicated on ingestion. To enable this, we track the identity of the replica from which we have “elected” to accept writes; if we don’t see metrics from this replica for a period of 15 seconds, we accept writes from the replica we receive a write from next, thus changing the “elected” replica. To track these identities we use etcd, a raft-based distributed key-value store that is also used by Kubernetes to store runtime and configuration data.

The incident

At 11:19:00 UTC on March 16, the node running the etcd leader was abruptly terminated by a scheduled Kubernetes version upgrade. By design, another etcd replica automatically became the leader ~10 seconds later.

About 60 seconds later, two replicas of the Cortex distributor (the service on the write-path that is responsible for deduplicating samples) started logging “context deadline exceeded” when trying to fetch keys from our etcd cluster, and writes for some customers via those replicas started failing.

The issue was caused by a stuck TCP connection from the distributors to the old etcd master, which can happen when the underlying node dies and the TCP connection is not gracefully closed.

Detection and resolution

We use SLO-based alerting, which alerted our oncall engineer to the problem at 11:31, a full 11 minutes after it started. As this problem manifested as an error rate of 20%, it would have taken ~18 hours to breach our monthly SLA.

The affected distributors were restarted at 11:32, and the errors stopped, ending the incident.

Takeaway

It is important that we learn from this outage and put in place steps to ensure it does not happen again.

The total length of the incident was 12 minutes. As the timeout error is considered recoverable by Prometheus’s remote_write, the failed writes were retried by the sending side and succeeded when they hit other Distributor replicas. This means the incident resulted in no data loss.

The etcd client supports gRPC keepalive probes which were not correctly configured in Cortex. The incident was reproduced with iptables rules in our dev environment, dropping packets between a distributor instance and the etcd instance it was connected to. Enabling the keepalive probes in the etcd client was shown to prevent the problem:

cli, err := clientv3.New(clientv3.Config{
	Endpoints:   cfg.Endpoints,
	DialTimeout: cfg.DialTimeout,
+	DialKeepAliveTime:    10 * time.Second,
+	DialKeepAliveTimeout: 2 * cfg.DialTimeout,
+	PermitWithoutStream:  true,
})

We have also followed up with our infrastructure team to work out why the machine was abruptly terminated; as the etcd Pods are managed by the etcd operator, our scripts didn’t know how to gracefully reschedule them. What’s more, the API call our scripts used to terminate an instance only gave the instance around 90 seconds to cleanly shut down. We are working to ensure our Kubernetes upgrade process gracefully terminates Pods and machines going forward.

This outage is not all bad news: We relied heavily on Grafana Loki, our new log aggregation system, to quickly dig into logs during the outage, recovery, and for the post mortem. We wouldn’t have been able to do that work as quickly and precisely without Loki. This meant reduced time-to-recovery for our users and less mental overhead during a stressful phase for us.

Related Posts

How The Trade Desk evolved their homegrown stack into a modern-day system that saves time and money
Cortex v0.1 has been released. Here’s what’s new with the CNCF Sandbox project.
On July 19, Grafana Cloud’s Hosted Prometheus service experienced a 30-minute outage. Here’s our incident postmortem.

Related Case Studies

Hiya migrated to Grafana Cloud to cut costs and gain control over its metrics

To scale Prometheus, says Senior Software Engineer Jake Utley, Grafana Cloud was ‘the most in line with what we wanted to accomplish.’

"We wanted the ability to look at our own information and understand it from top to bottom."
– Dan Sabath, Senior Software Engineer, Hiya

How Cortex helped REWE digital ensure stability while scaling grocery delivery services during the COVID-19 pandemic

Cortex’s horizontal scaling has been crucial; reads and writes increased significantly, and the platform was able to handle the added load.

"We wanted a software-as-a-service approach, with just one team that provides Cortex, which can be used by all the teams within the company."
– Martin Schneppenheim, Cloud Platform Engineer, REWE digital

How Gojek is leveraging Cortex to keep up with its ever-growing scale

Gojek’s Lens monitoring system has 40+ tenants, for which Cortex handles about 1.2 million samples per second.

"The goal is to make sure that whenever a new service or team is created, they automatically get onboarded to the monitoring platform."
– Ankit Goel, Product Engineer, Gojek