DevOps Handbook by Gene Kim

April 18, 2018

The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, & John Willis presents the goals, principles, culture, and high-priority technical practices of DevOps.

The DevOps Handbook presents the goals, principles, culture, and high-priority technical practices of DevOps.

Introduction

The book challenges the notion that DevOps is primarily about automation.

DevOps isn’t about automation, just as astronomy isn’t about telescopes.

The book suggests that DevOps is about culture as much as it is about technology. It is "a manifestation of creating dynamic, learning organizations that continually reinforce high-trust cultural norms".

The Three Ways: Flow, Feedback, and Learning

The Book is seperated into an brief introduction of the history, principles. Then it divides the remainder of the book into a focus on the technical, operational, and cultural practices that underpin each of the Three Ways: (1) Flow, (2) Feedback, and (3) Continual Learning and Experimentation.

Here is the authors' definition of each of the three ways:

The First Way enables fast left-to-right flow of work from Development to Operations to the customer. In order to maximize flow, we need to make work visible, reduce our batch sizes and intervals of work, build in quality by preventing defects from being passed to downstream work centers, and constantly optimize for the global goals.

Second Way enables the fast and constant flow of feedback from right to left at all stages of our value stream. It requires that we amplify feedback to prevent problems from happening again, or enable faster detection and recovery.

The Third Way enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures.

The author suggests that DevOps breaks the "core, chronic conflict between 'doing it right' and 'doing it fast' can be broken with DevOps".

Transformation

The authors suggest that a DevOps Transformation has three ideal phases. Change agents build and expand their coalition and base of support for DevOps by:

Finding Innovators and Early Adopters
Building Critical Mass and Silent Majority
Identifying the Holdouts

In value streams of any complexity, no one person knows all the work that must be performed in order to create value for the customer - especially since the required work must be performed by many different teams, often far removed from each other on the organization charts, geographically, or by incentives.

First Way: Fast Flow (and the Foundations of DevOps)

The Authors start by outlining the benefits of Flow (the First Way) and the technical practices that support it.

Foundations of Deployment Pipeline

The book outlines the concept of a Deployment Pipeline, how it differs from the original definition of Continuous Delivery, and the practices that support it:

Enable on-demand Creation of Dev, Test, and Production Environments
Create a single Repository of Truth for the Entire System (all project artifacts needed for creating the product including infrastructure because "there are orders of magnitude more configurable settings in our environment than in our code")
Make Infrastructure Easier to Rebuild than to Repair
Modify our Definition of Development "Done" to Include Running in Production-Like Environments

Test Automatically on Each Small Change, Continuously

The Book explains how Test Automation eliminates queues, reduces batch size and promotes learning. They explain that integration testing is done continuously to develop an observable connection between cause and effect.

The goal of the deployment pipeline is to provide everyone in the value stream the fastest possible feedback that a change has taken us out of a deployable state.

If we were to batch up change, say nightly, we would have more difficulty picking out the cause of any failing test. Is it configuration? Code? If there are multiple teams contributing to the batch, now we have cross-team coordination challenges. All this adds to confusion, waste, and creates back-pressure on the flow of working software to production.

A complement to the technical system is a culture of teamwork and trust.

What enables this system to work at Google is engineering professionalism and a high-trust culture that assumes everyone wants to do a good job, as well as the ability to detect and correct issues quickly.

Prioritize Speed Over Robustness

The authors promote testing practices that prioritize speed over robustness, including virtualizing remote services and promoting unit tests over acceptance tests.

Whenever we find an error with an acceptance test, we should create a unit test that could find the error faster, earlier, and cheaper. (p. 132)

A healthy testing pyramid emphasizes fast-running unit tests. Stub out external services for speed and test reliability.

Perform Exploratory Tests on the Latest Build

The authors note that unlike how many teams practice code freezes, exploratory testing is best done frequently on the latest version.

We make any build that passes all our automated tests available to use for exploratory testing, as well as for other forms of manual or resource-intensive testing (such as performance testing). We want to do all such testing as frequently as possible and practical, either continually or on a schedule. (p. 134)

Prioritize a Few Reliable Tests Over Many Unreliable Tests

Unreliable tests that generate false positives create significant problems.

A small number of reliable, automated tests are always preferable over a large number of manual or unreliable automated tests.

We start with a small number of reliable automated tests and add to them over time, creating an ever-increasing level of assurance that we will quickly detect any changes to the system that take us out of a deployable state. (p. 136)

Trunk-Based Development

The authors spend a good deal of time discussing and advocating Trunk-Based Development.

The data from Puppet Labs 2015 State of DevOps Report is clear: trunk-based development predicts higher throughput and better stability, and even higher job satisfaction and lower rates of burnout. (p. 151)

Continuous Integration makes Trunk-Based Development a realistic feature part of every developer's daily work.

the longer developers are allowed to workin their branches in isolation, the more difficult it becomes to integrate and merge everyones changes back into trunk. In fact, integrating those changes becomes exponentially more difficult as we increase the number of branches and the number of changes in each code branch. (p. 143)

Technical Debt

The authors believe that moving pain leftwards in the process forces developers to confront it. A culture of continuous refactoring pays down technical debt so that it doesn't become a bigger problem.

Technical Debt: when we do not aggressively refactor our codebase, it becomes more difficult to make changes and to maintain over time, slowing down the rate at which we can add new features. Solving this problem was one of the primary reasons behind the creation of continuous integration and trunk-based development practices, to optimize for team productivity over individual productivity. (p. 148)

Deploy Fast

The book then focuses on another improving another Lean metric: Lead Time. They emphasize that rapid deployment is one of the key indicators of high performance teams.

In Puppet Labs 2014 State of DevOps Report, the data showed that high performers had deployment lead times measured in minutes or hours, while the lowest performers had deployment lead times measured in months. (p. 161)

Decouple Deployment from Releases

We should decouple Deployments from Releases, the former is a technical capability, the latter is a business decision. When we conflate deployment with release we create unnecessary back-pressure on the fast flow of work that leads to lower quality and business outcomes.

There are multiple ways to do this.

Environment-Based Patterns

One is using environment patterns, that deploy newer versions of code to different environments. Sending traffic to these newer environments is a business decision (e.g. Canary Releases, Blue-Green Deployments). Leveraging the "Immutable Environment" pattern has the added benefit of making releases much more controlled.

Application Based Release Patterns

Application Based Release patterns, for example configuring Feature Toggles, are an alternative or possible complement for Environment-Based Patterns. The challenge with application based approaches is added code complexity. The benefit is that it can help get larger features out the door more quickly, and avoid long-lived feature branches.

Three Architecture Archetypes

The book outlines three Architecture Archetypes:

Monolithic v1 (all functionality in one application)
Monolithic v2 (sets of monolithic tiers)
Microservice (modular, independent, graph relationship vs tiers, isolated persistence)

The benefits of the Monolithic architecture is faster startup time and initial simplicity. However, these monolithic architecture types have significant drawbacks as soon as the product needs to scale or change.

The consequences of overly tight architectures are easy to spot: every time we attempt to commit code into trunk or release code into production, we risk creating global failures. (p. 180)

The alternative is the Microservice Architecture.

In contrast to a tightly-coupled architecture that can impede everyone's productivity and ability to safely make changes, a loosely-coupled architecture with well-defined interfaces that enforce how modules connect with each other promotes productivity and safety.

A monolithic or microservie architecture style performs better when we understand the context where its benefits outweigh its drawbacks.

Monolithic architecture that supports a startup (e.g. rapid prototyping of new features, and potential pivots or large changes in strategies) is very different from an architecture that needs hundreds of teams of developers, each of whom must be able to independently deliver value to the customer. (p. 183)

The authors also note the benefits of aligning teams to services, when a product scales with Microservice architecture.

Strangler Application Pattern

The book explores how teams transform from a legacy monolithic architecture pattern to a new Microservice pattern.

Abstract the old system behind APIs and put our new system in front of it conforming to the new Architecture. This pattern ensures a departure from the old architecture pattern by enforcing loosely coupled relationship between new services and the old application.

The result is that the old system slowly shrinks in functionality until it eventually disappears.

The Second Way

The second way is about creating a system (and a supporting culture) of feedback loops.

The best performing organizations were much better at diagnosing and fixing service incidents, what Kevin Behr, Gene Kim, and George Spafford called a "culture of causality" (p. 195)

Telemetry

Its not just about having data, its about having people that were interested in and acted upon data.

The top two technical practices that enabled fast MTTR:

The use of Version Control by Operations
Having Telemetry and Proactive Monitoring in the Production Environment

The architecture of effective telemetry models include these components:

Data collection at the business logic, application, and environments layer
An Event Router Responsible for Storing our Events and Metrics

Learning from Incidents

Telemetry gives us tools to take a scientific approach to problem solving production incidents. The book offers sample questions:

What evidence do we have from our monitoring that a problem is actually occurring?
What are the relevant events and changes in our applications and environments that could have contributed to the problem?
What hypotheses can we formulate to confirm the link between the proposed causes and effects?
How can we prove which of these hypotheses are correct and successfully effect a fix?

The value of fact-based problem-solving lies not only in significantly faster MTTR (and better customer outcomes), but also in its reinforcement of the perception of a win/win relationship between Development and Operations. (p. 204)

Creating Launch Guidance

By creating launch guidance, we help ensure that every product team benefits from the cumulative and collective experience of the entire organization especially Operations.

Launch guidance and requirements will likely include the following:

Defect counts and severity
Type/frequency of pager alerts
Monitoring coverage
System architecture
Deployment process
Production hygiene

Questions we might ask before we launch a feature include:

Does the service generate a significant amount of revenue?
Does the service have high user traffic or have high outage/impairment costs? (i.e. do operational issues risk creating availability or reputation risk?)
Does the service store high risk data such as Personal Identifiable Information (PII), Credit Card data, or Health Records?
Does the service have any other regulatory or contractual compliance requirements associated with it, such as US export regulations, PCI-DSS, HIPAA?

Hypothesis Driven Design

By running A/B experiments and monitoring key performance indicators, we can develop a better and better understanding of our customer's needs and how our product can satisfy them. This eliminates risks from assumptions and maximizes the value of our work.

The period when experimentation has the highest value is during peak traffic seasons. (p. 242)

A template for Hypothesis definition

We Believe...
Will Result in...
We will have Confidence to Proceed When...

The organizational learning that comes from experimentation also gives employees ownership of business objectives and customer satisfaction. (p. 248)

Peer Reviews

Using Github Flow:

Descriptively Named Branch
Commits to Branch Locally
Issue a Pull Request (when feedback is needed or feature is ready)
Pull Request is Approved after a peer review and Branch is merged to Main
Deploy to Production

Change and testing controls (like gated approvals) have counter-intuitive results: usually controls produce longer lead-times which weaken the relationship between cause and effect. These controls lead to lower quality, higher risk, and worse results - precisely the opposite of the objectives they set out to remedy. Here, our common-sense approach causes negative reinforcing loop.

Our goal is to ensure that Development, Operations, and InfoSec are continuously collaborating so that changes we make to our systems will operate reliably, securely, safely, and as designed. (p. 251)

To protect against technical failures, we make our system more fault-tolerant through:

Redundancy
Failover
Comprehensive Testing
Simulation

The Principle of small batch sizes also applies to code reviews. (p. 255)

Guidelines for Code Reviews include:

Everyone has someone to review their changes
Everyone should monitor the commit stream
Define which change qualify as high risk
Split up changes that are too large to reason about easily

The best code reviews are feedback that happen in real time

Pair Programming
"Over the Shoulder"
Email Pass-Around
Tool-assisted Code Review (Gerrit, Pull Quests, Crucible)

Pair Programming doesn't allow people to passively opt out of reviewing code. To pair program, you need to understand the code because you are actively contributing to it. Pair Programming results in higher quality code, and because the code review happens in realtime there is no delay.

Eliminate Bureaucracy

There is often processes and meetings that add delays and effort to get a release to production. Our goal is to eliminate as many of these processes so that we can our products to customers as quickly as possible, and get feedback from those customers to ensure our product is providing increasing value.

Quoting John Allspaw, "Did you have someone review your change? Do you know who the best person to ask is for changes of this type? Did you do everything you absolutely could to assure yourself that this change operates in production as designed? If you did, then don't ask me - just make the change!"

Creating the conditions that enable change implementers to fully own the quality of their changes is an essential part of the high-trust, generative culture we are striving to build.

Third Way: Technical Practices of Learning

This Part is about the The Third Way (establishing a Learning Culture), and specifically the "institutional rituals that increase safety, continuous improvement, and learning". This part recommends four of these practices:

Establish a Just Culture
Inject Production Failures to Create Resilience
Convert Local Discoveries into Global Improvements
Reserve Time to Create Organizational Improvements