Site Reliability Engineering by Betsy Beyer

September 13, 2020

Site Reliability Engineering explains how Google runs production systems at scale. It is a compilation of articles which balance theory and practice. It is a companion to generalized DevOps theory, showing what a digital native DevOps implementation looks like.

Introduction

Site Reliability Engineering by Betsy Beyer et al., takes us inside Google, a company that faces the challenge of running technology at global scale. Many companies (particularly digital natives like Facebook, Uber, Netflix, and AirB&B) face the same challenges as Google and can learn from its approach. Although I would have liked to have seen more explicit treatment of how the SRE approach makes tradeoffs versus other DevOps implementations, the book is overall an excellent read, particularly for Digital Native practitioners at scale.

Principles and Practices of SRE

The foreword by Mark Burgess explains that value of the book can be found in its expressed principles. The practices and case studies help to illustrate these principles.

Theres a brazen bulture of "just show me the code." [...] where community rather than expertise is championed [...] Google is a company that dared to think about the problems from first principles [...] Implementations are ephemeral, but the documented reasoning is priceless.

--Mark Burgess, author of In Search of Certainty

The text defines SRE (the function and the team)
- SREs are Engineers. People who "apply the principles of computer science and engineering to the design and development of computing systems".
- SREs are concerned with reliability. Quoting Ben Treynor Sloss, originator of the term SRE, "reliability is the most fundamental feature of any product: a system isn't very useful if nobody can use it!" So SREs are concerned with ensuring people can use their services.
- This is not altogether remarkable, until the approach they take emphasizes engineering reliable solutions to existing (or anticipated) problems. "SRE is what happens when you ask engineers to design an operations function."
- This is in contrast to what they term "conventional IT industry practices" that tend to exercise static patterns that rely on manual intervention.
- SREs prioritize engineering systems to replace what conventional IT would depend on (manual intervention) which the authors term Toil.
It presents the case that Google's SRE approach has significant benefits over conventional IT practices
- Attracting Top Talent: these people are attracted to solving new and interesting problems. A company that advertises a role involving significant toil will have difficulty competing with Google for talent.
- Lower Cost @ Scale: SRE systems tend to scale sub-linearly while conventional IT maintenance and support systems tend to scale linearly or exponentially.
- Reliability: engineering systems scale more reliably than human interventions, especially for predictable routines and events.

SRE is a DevOps Implementation

DevOps emerged as a movement opposed to the functional divides that arise in large organizations. The book explains how SRE and the DevOps movement are related. The author suggests "One could view SRE as a specific implementation of DevOps with some idiosyncratic extensions".

This point is made simply and succinctly, but given the initial proposal of principles over practice, I would have liked more elaboration on what principles SRE has contributed to the DevOps movement, and where it may diverge. The book avoids contrasting the approach with other DevOps implementations and so leaves out what (if any) tradeoff decision points were reached in the development of the approach.

Is the book a contribution to the field or discipline?

As you might expect, the book is very well written and edited. Although several sections dive exclusively into specific case studies, overall there is a good balance of theory and practical advice.

The book is a window into the nuts and bolts of performing DevOps at scale. As the authors point out, it is most useful as a concrete example of how good principles can be expressed in the context of a scaled digital native. The book's practices and case studies are most immediately useful to people operating in a similar context of large distributed systems with high demand and need for change.

Although many of the topics covered are standard in the DevOps community (e.g. prioritizing recovery metrics over failure metrics), in addition, the book presents some new (to me at least) concepts that are tractable and transferable, such as:

A definition of Toil as well as ways to measuring and managing it

Toil is the kind of work died to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Service Level Objectives
Error Budgets
Four Golden Signals
- Latency
- Traffic
- Errors
- Saturation
Game Days (formal practice workshops)

The authors also offer valuable opinions on logging, monitoring, and alerting (of which they spend several chapters discussing). My takeaways from these discussions are:

More data is not necessarily better information and often can be counter-productive
Alerting should prompt responders to take an action that requires intelligence
Data should be observed in context (e.g. events over time) and assist us in performing specific analytical tasks
Value of a deliberate approach and sufficient preparation and training to troubleshoot problems effectively

I especially liked a section on "Designing at the right level" that termed a design approach "agnosticism - writing the software to be generalized to allow myriad data sources as input." It is part of an overall pattern that the author suggests cloud software evolves towards abstraction as it scales. This is because reliability can be better guaranteed if the system doesn't rely on concrete predictions over future inputs.

Another section that stood out to me was the "General Principles of SRE as Applied to Data Integrity"

Beginners Mind
Trust but Verify
Hope is Not a Strategy
Defense in Depth (multi-tiered strategies are better than single-tier strategies)

Conclusion

Site Reliability Engineering is a good read for practitioners and anyone interested in how DevOps is implemented at Google. While the specific practices may become dated quickly, the principles are likely to have broad application for the foreseeable future.