Litmus Insights: Diagnosing Human Fail

Published: 17 Nov 2015 by Matt Toback RSS

Symptoms: Gradual global meltdown that’s not DNS.

A few months ago, just as we’d started boarding the first handful of users onto the initial alpha of our platform, we received a critical alert early on a Sunday morning: our website was offline.

The strange thing was some of us were able to get to it just fine.

We’re believers of eating our own dogfood where appropriate, so we immediately logged into Litmus and pulled up the Grafana summary dashboard that shows performance and availability from dozens of locations around the world.

Here’s what the overall health of the website looked like: (These are interactive embedded Grafana panels. Read more about them.)

This didn’t look good.

Seeig larger amounts of red told a story: the problem was spreading, and getting worse.

We could see it happening, in real time.

We were slowly going dark all over the world.

We immediately thought that it must be a DNS problem. Expiring TTLs on DNS resolvers around the world would cause a gradual outage just like this.

But the Dashboard disagreed: DNS was just fine. We were seeing great global performance and 100.00% global availability and correctness of our DNS.

Just another normal day for our awesome DNS host NS1.

(^DNS availability over time)

(^DNS performance over time [each Location])

So, It wasn’t a DNS problem.

There’s something that could explain this exact picture…

Something that would cause a gradual failure of DNS around the world, independent of the health of our DNS service. A problem at the registrar…

A quick whois check on the console confirmed our fears.

Our domain had expired.

We weren’t the first company to make this mistake and certainly won’t be the last. We registered raintank.io last summer as just one of many potential company names, with a now-deprecated email address. A year passed in the blink of an eye, and renewal notices were going into the ether.

A dashboard can be worth a thousand sleepily written RFO’s.

The screenshot (or better yet, the snapshot) below illustrates the entire debacle better than this blog post ever could.

Domain expiration dashboard

Needless to say, we were embarassed, but the silver lining was seeing some of the value of the Grafana Dashboard that Litmus provided. It helped us become aware of and diagnose this mishap in record time. We benefitted from what we’re building.

Next steps include monitoring expiration times of certificates and domains as a feature on Litmus. It’s definitely something we’re thinking about, especially as we build alerting in Grafana.

If you’d like to get free visualizations like this about your own infrastructure then sign up for early access below. We’re still hard at work building v1 of our platform, the only thing we ask is your feedback to help make it better.

comments powered by Disqus