Why optimizing for MTTR over MTBF is better for business

Published: 1 Jul 2020

The classic debate when running a software as a service (SaaS) business is between release frequency vs. stability and availability. In other words, are you Team MTTR (mean time to recovery) or Team MTBF (mean time between failure)?

In this blog post, I argue for MTTR, which encourages you to push more frequently, embrace the instability this may introduce, and invest in training and tooling to deal with the ensuing outages.

It’s really the concept of continually pushing out minimum viable products (MVPs), testing in production, and embracing failure. Here’s why that works.

More testing leads to better quality

The whole idea of optimizing for MTTR can be counterintuitive for engineering teams because it can be stressful when things go wrong. But that mentality is exactly why people who leverage a MTBF strategy find it laborious to fix problems when they occur. When failures happen infrequently, it’s hard to optimize your response process.

With a constant stream of releases, there is more testing and more code review because there is the expectation — and even the encouragement — to fail. With that in mind, the team is ready to optimize for failure recovery and to iterate on the code, which in the end gives them greater familiarity with the product and yields better reliability.

For those who minimize how often code is deployed, changes also take longer to implement. The 80/20 principle applied to engineering says that the last 20 percent of the work takes 80 percent of the time. With SaaS, the last 20 percent is typically getting your feature deployed through staging and into production. If you only deploy infrequently – say, weekly or monthly – the “quantization” size for each feature can’t be smaller than a week.

While this conservative approach leads to a more stable site on paper, in practice this results in a stale product. Oftentimes the product that wins in the marketplace is not the best one. It’s the one that responds to customer needs quicker.

With an MTTR approach, we are not necessarily investing all of our effort in building the most available product. We’re simply investing our effort in building the minimal product and tightening the feedback cycle as much as possible. When the product doesn’t act right – and that will happen often – we quickly can stand up a new, and even better, service that reflects the customer’s changing needs.

Embracing uncertainty leads to stability

Unlike traditional enterprise software, SaaS can be released very frequently, as often as multiple times per day. This allows SaaS businesses to respond to changing customer demands quickly all while putting zero burden on its users. (i.e. They don’t need to upgrade their software constantly.)

That being said, typically there are no “releases” or updates to a platform during the holiday season. (See “Google’s Big December Code Freeze.") During this period there are a) lots of customers and your most profitable period and b) lots of key staff taking time off. Hence, many companies don’t want to deliver releases that might jeopardize sales and pull people away from their holiday plans.

So e-commerce websites have code freezes. Many products stop doing releases at this time. All these practices prove that to optimize MTBF, you minimize change.

For a SaaS vendor focused on MTTR, the holiday season is the most stable time of year – as are all holidays – because of the high frequency at which the team deploys updates and the familiarity every developer has with the code base.

This phenomenon is one of the big reasons why SaaS, and no longer just software, is “eating the world,” as Marc Andreessen famously declared in his 2011 essay in the Wall Street Journal.

With each new release, however, comes the risk of introducing new bugs and outages.

Continuous deployment, blue-green deployments, and canarying are all examples of techniques used to reduce that risk. The idea is that by making releases more frequent, you can minimize changes between any two deployments. As a result, there is a reduced risk of erratic interactions and a greater chance of quickly pinpointing which release, and therefore which change, caused a problem.

Strengthening the team

Finally, MTTR helps build a more resilient on call team who doesn’t flinch when there is an outage. When the team is trained to solve for failures regularly, there is no stress when they get paged and the process can be automated which means faster fixes.

Another reason instability is welcome is because new versions of the code are likely to fail in unexpected and exciting ways.

These unpredictable problems will help train new members of the on call team – it’s hard to instill confidence in engineers with drills and training alone. Sometimes you just need real incidents.

One approach is to artificially introduce problems à la Netflix’s “chaos monkey”. This helps you find things like single points of failure, but this only works well for systems that aren’t experiencing a high rate of change to begin with.

Another approach is to deploy new software more often, which can introduce real problems to fix. Let’s not forget Steve Jobs’s famous words: Real artists ship.

Here’s how to optimize for MTTR

To recap, if you choose to optimize for MTBF, you will release less frequently, which results in a “stale” product that can’t respond to changing customer demands. You will also have an on call team that doesn’t get paged frequently and, therefore, each new alert is a high-stress situation.

By optimizing for MTTR, you’re leaning into a team who knows how to respond to and fix failures quickly. You’ll also implement a high release/deploy cadence that allows you to quickly respond to customers' needs and ship features they want.

So, how do you do this?

  • Adopt tools and technologies such as Kubernetes, which help you automate release and deploy and do them frequently
  • Ensure your application is well-instrumented and that you have a solid observability strategy that includes Grafana, Prometheus, and Loki. With the right monitoring tools, your team will have the confidence to solve issues in production.
  • Track release cadence and on call load (incidents per shift) and balance the two. Too many incidents? Push back on features and focus on tech debt. Too few? Push the team to take more risks.
  • Encourage iteration. What can you do to release more frequently? What pages most often?

Conclusion

Some people use the “worse is better” theory as a criticism. I wear it as a badge. In the end, it’s all about being agile and getting something out there so people can test it even though the product may not be the best in class yet.

A lot of people put their efforts into building massively available services with no customers. Whereas when you start from zero and build a service, you can move quickly even if that means it may be unreliable at first. As outages occur or bugs surface, your team will adapt and improve on the system, learning from the problems that arise and pivoting as needed. Or they’ll just blow everything away and stand up a new product really quickly. Anything is possible.

As they say in F1: It’s easier to make a fast car reliable than make a reliable car fast.

Related Posts

Ganesh maintains the Prometheus storage engine, TSDB.
Learn how to turn a Prometheus histogram into a stat panel, bar gauge, or heat map in Grafana
How Grafana Labs leverages the regexp syntax package to simplify and improve Loki regex performance