Ever wonder why you almost never hear about Facebook, Amazon, Netflix, or Google suffering from major outages or incidents, even though they deploy astronomically more code than other organizations? It’s an especially good question when you consider that we’re at an inflection point where enterprises are struggling to move quickly, build new apps, and keep up with this leading class of next-gen companies, all the while trying to cope with a constant fear of downtime. It’s no secret that it only takes one bad code push to cripple an entire organization and that even seconds of downtime will deeply impact the brand and bottom line.
So what separates the Facebooks of the world? While DevOps and agile methodologies play a role in enabling these web-scale companies to move fast, the real key to maintaining quality is a less-known strategy called “Reversibility.” Fundamentally, Reversibility removes the spectre of a bad customer experience and frees engineers to develop and deploy applications more aggressively. In 2015, Kent Beck, a Facebook programmer, published a post, Taming Complexity with Reversibility, that sheds light on this crucial secret and the principles behind it.
What is Reversibility?
Reversibility is the ethos of a safety net for modern development and operations. To update Mark Zuckerberg’s famous quote, today’s product teams have to move fast without breaking things. I like to think of it as a magic undo button. Most next-gen organizations have it baked into their software development processes in some way or another, even if they don’t necessarily have a name for it or implement it end-to-end. Every time you hear about microservices design, continuous integration and deployment, feature flags, or canary releases, think “Reversibility.”
As Beck notes in his post on the topic, Reversibility serves as a way to combat one form of complexity, and complexity is the enemy of scaling systems of any kind. While Beck does justice to the various forms of complexity in his post, for our purposes, we’re going to look specifically at dealing with irreversibility, which he describes as: “When the effects of decisions can’t be predicted and they can’t be easily undone, decisions grow prohibitively expensive.”
The main idea behind Reversibility is that big, immutable projects and pushes run lower risk if they’re sliced into smaller pieces that can be rolled back, changed, fixed, and even redeployed before they inflict damage. It’s a philosophy and a cultural decision that should reflect everything from the teams you build to the software you create, from how you develop your monitoring systems to how you design and implement your APIs.
Visibility Reduces Risk
Most traditional software development processes are focused on limiting or managing complexity with an overall goal of identifying as much risk up front as possible, a time-consuming and painstaking process that doesn’t cover all your bases. Moreover, the highly elastic, widely distributed, and scale-out nature of modern applications makes it near impossible to know the impact dependencies will have on performance prior to deployment. And, even if you do know the probability of an issue prior to release, chances are that probability can only be applied to part of the infrastructure, architecture, or software, not your entire service. Reversibility gives organizations the confidence to get creative and try new things with less fear of error or harming the customer’s experience.
In this way, developing methods that support Reversibility becomes more important than ever for any organization building or scaling a cloud application. This is doubly true for architectures that incorporate containers and microservices, which tend to have a lifetime of hours or even minutes. If you want the speed and agility that come with a modern runtime environment, a Reversibility strategy serves as a buffer and a safeguard.
We should also recognize that the Reversibility safety net is not only in service of moving faster and becoming more flexible. Reversibility, paired with the right level of visibility, is also a requirement for supporting today’s elastic cloud environments. The scalable, ephemeral nature of the cloud means that problems will inevitably happen sooner than later, and a plan for identifying, isolating, and acting on performance issues is key to maintaining your operations. Specifically, visibility without the option to reverse a troublesome deployment means that you’ll suffer in the knowledge that a mistake is wreaking havoc on your end-user’s experience. Reversibility without timely and relevant insight means that, when you hear from a customer or your supervisor that there’s a problem, it may take some trial-and-error before you know what to roll back.
Resolving the Agility-Stability Paradox
Reversibility isn’t a concrete practice as much as an ideology supported by the right tools and processes, and it goes hand-in-hand with the goals many enterprises are pursuing of becoming more user-centric, innovating to create value, and responding faster to the market. If you’re an organization that’s embracing DevOps, building and orchestrating a containerized architecture with Docker or Mesos, or developing applications with services like Elasticsearch and Kafka, you’ve hopefully already subscribed to a Reversibility ethos.
In so many ways, Reversibility is the answer to overcoming the agility-stability paradox. We all want to move at breakneck speeds to stay competitive and identify new revenue streams. But we also don’t want to crash and burn, leaving a trail of lost customers and profits along the way. It enables the latest development practices and facilitates digital transformation. Reversibility makes modern technology easier to adopt and ensures enterprises can implement new strategies with minimal operational risk.