Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

Grot cannot remember your choice unless you click the consent notice at the bottom.

What does the future hold for Site Reliability Engineering?

What does the future hold for Site Reliability Engineering?

1 Dec, 2020 4 min

Site Reliability Engineering, or SRE for short, has become quite the buzzword. I wasn’t there in 2004, when Ben Treynor started it at Google, but I claim bragging rights based on the fact that the very same Ben Treynor interviewed me for an SRE role in 2005. (I also got the job after the interview, in case that wasn’t obvious…) When SREcon EMEA 2019 came along, I thought it was just about time to publicly speculate about the future of our profession. Thus, I gave a talk titled SRE in the Third Age. At least two people liked it: Emil Stolarsky and Jaime Woo. In fact, they liked it so much that they encouraged me to write an essay version of my talk for their upcoming O’Reilly book 97 Things Every SRE Should Know. If everything goes as planned, the print version will hit the shelves in mid-December, while the online version should be available even as you are reading this. In any case, if you cannot wait, here is a sneak preview of my essay, as you will find it at the very end of the book, after 96 other concise and useful tips from across the industry. Enjoy!

In the first age, SRE was proprietary to Google, and knowledge about it left the company only by diffusion.

In the second age, SRE was set free. The Site Reliability Engineering book in 2016 made blatantly obvious that a fundamental change was happening: from a weirdly named department within Google to a generally known profession. The fittingly named SREcon has happened regularly and increasingly successfully since 2014. In the job market, SRE is a downright buzzword, appearing in resumes and job descriptions everywhere.

The incredible popularity of SRE makes me believe we have reached the late stage of the second age, and its conclusion will be marked by an interesting inversion of the current hiring hype: the end of the dedicated SRE role as we know it.

How? During the second age, many organizations quickly realized that their much smaller size prevented them from performing SRE exactly like Google. Even an organization large enough to maintain a dedicated SRE team—which most couldn’t—usually came to the conclusion that they couldn’t just hire a certain number of SREs to do “that SRE thing.” Instead, every engineer had to become a part-time SRE. Seeking SRE documents a number of those stories, including (shameless plug) my own witness account as a Production Engineer at SoundCloud.

From that perspective, the high demand for SREs on the work market is mostly driven by the desire to find someone to spread SRE knowledge among the other engineers. An organization that has truly arrived in the third age is one where that has already happened. All engineers can wear an SRE hat as part of their job, and at least smaller organizations will then stop hiring dedicated SREs. Instead, an SRE mindset will be an important hiring requirement for every engineering role.

What necessitates those transitions between ages, you may ask? For the second age, it was the democratization and proliferation of cloud-native technologies. Looking at the very insightful definition published by the CNCF, cloud-native technologies allowed even small organizations to quickly reach a level of complexity and scale where SRE becomes a necessity.

For the third age, it will be the optimization of the portion of engineers that work in a dedicated SRE role rather than directly on the actual product. An organization at second-age maturity, where most engineers act as part-time SREs, will realize that most of the tasks of the remaining dedicated SREs could be handed over to service providers, including but not limited to traditional infrastructure-focused cloud providers. In fact, the increasing selection of “higher order” services, which run on top of other cloud services, will drive most of the opportunity growth.

There is a trade-off, of course. The larger an organization, the more efficient it is to run a larger part of their stack on their own. But with the steady innovation of the service providers, the bar is moving up here.

Will we soon enjoy “SRE as a service” so we can completely forget about operational concerns? On the contrary. In a second-age scenario, it is actually easier for engineers to get away with a certain amount of operational ignorance by relying on SREs within the organization. In the third age, most engineers will be very close to production, enabled by the SRE-inspired tools and services at their fingertips. To use those effectively, they will require an SRE mindset.

The unimaginable power of SRE in the third age is that it will (and has to) be in everyone’s head. The moment universities include SRE classes in their computer science programs will be a sure sign that the third age has begun.