How we responded to a 2-hour outage in our Grafana Cloud Hosted Prometheus service
On Thursday, March 11, we experienced a ~2-hour outage in our Grafana Cloud Hosted Prometheus service in the us-central region. To our customers who were affected by the incident, we sincerely apologize. It’s our job to provide you with the monitoring tools you need, and simply put, when they are not available, we make your life harder.
We take this outage very seriously. After an extensive incident review, we put together this blog post to explain what happened, how we responded to it, and what we’re doing to ensure it doesn’t happen again.
We run ~15 large clusters of Grafana Enterprise Metrics (GEM) on various Kubernetes clusters around the world. These clusters are multi-tenanted; they are securely shared between many users of Grafana Cloud using the data and performance isolation features of GEM.
In particular, we set limits on how much data any single tenant can send to the cluster and how quickly the data can be sent. These limits are designed to protect other tenants and the cluster as a whole from the possibility of the cluster getting “overloaded” if a single tenant writes too quickly. The clusters are provisioned/sized based on these limits, among other things.
On Wednesday, March 10, we onboarded a new customer as a tenant on one of our us-central GEM clusters. We increased their active series limits based on their forecasted usage. The next day, the new tenant started sending more data than expected, unbeknownst to them. The configured limits for this customer should have protected the cluster. However, a bug in how limits were enforced meant the tenant was able to overwhelm the cluster, through no fault of their own.
This overload caused a cascading failure throughout our stack. We discovered that our error-handling paths are more expensive (CPU-wise) than our success path. Our internal authentication gateways became overloaded and failed their health checks, subsequently causing our load balancers to remove them from the pool.
Initially we scaled up the cluster to attempt to cope with the load, but as the many Prometheus servers sending data to us began to recover, the ensuing waves of load caused a series of Out-of-Memory errors (OOMs) and other failures throughout the stack.
We then identified which customer was causing the overload and placed even stricter limits on that customer’s usage, and continued to scale up the cluster until it could handle the recovering load. Once the new tenant limits were imposed, the cluster recovered automatically with no additional intervention. Service was fully restored by 23:46 UTC on March 11.
We subsequently identified and fixed the underlying bug on March 12.
What we have learned and are improving
It is of the utmost importance that we learn from this outage and put in place necessary changes to ensure it does not happen again. These are the specific steps we are taking:
The change in the way limits were handled has been reverted and deployed (see https://github.com/cortexproject/cortex/pull/3948). Furthermore, new limits are being introduced to add multiple layers of protection (see https://github.com/cortexproject/cortex/pull/3992). In combination, these two changes will prevent individual customers from being able to overwhelm the cluster and any future bugs from interfering with the implementation of the limits.
The engineering team is improving our per-tenant limit management, introducing tighter limits and more automated management of limits. We are also improving the management of cluster scaling and provisioning, making the process more automated. This should ensure we never get into a situation where a single tenant can overwhelm a cluster.
We are in the process of recreating the overload scenario in our dev environment and investigating ways to ensure the entire stack fails more predictably (for example, by rejecting API calls insteading of crashing). In the unlikely event that a similar issue arises, this will allow us to recover more quickly.
We are also investing in incident response training and automation to improve our response time. We expect these investigations and resulting changes to be applied within the next six weeks.
Finally, we’re continuing to work with the wider community on improving the behavior of Prometheus when replaying data after an outage.