If you want to start a good discussion or argument about reliability at work, ask a colleague this question.
"When is architecture more important for the reliability of a service, product, or application? Before it is deployed to production, or afterward?"
Well, “surely”—you say—“if we don’t build the service with reliability in mind, it may not have the right components included to increase stability. It may not have redundancy to improve fault tolerance. Perhaps we will have left out robust retry logic, circuit breakers, or other known patterns for reliable systems.”
But maybe your colleague counters, “Well, I can’t deny that it is important to attempt to try and build things right from the beginning. But one thing I’ve learned about reliability is it is almost never achieved on the first go around. Even if you have done a phenomenal job at the whiteboard, designing with failure in mind, there are still going to be outages. And while nobody likes outages, if we handle them and a subsequent post-incident review correctly, we can learn a great deal that helps us make a service more reliable in the long term. On top of this, wouldn’t you agree that observability is an iterative process that involves changing what we measure and monitor as we learn more about the system while it is running? All these things would fall under the mantle John Reese and Niall Murphy called 'the wisdom of Production'. And all of these things surely need us to bring to bear all the architecture skills we have to do this right.”
If you are having a really good discussion, this goes back and forth across the table at least a few times. One side notes that “bolting on reliability after the fact” works about as well as “bolting on security after the fact” (that is to say, not well at all). The other side might bring up the lessons we’ve learned from chaos engineering showing us that experiments on a dev or staging environment can be very useful, but they don’t always yield some of the unique results we get from testing in production.
“But what about the value of continuous integration and continuous delivery (CI/CD) to reliability—trying to catch reliability issues before they get to production?”, gets asked. Then in response, “CI/CD is tremendously useful, but it didn’t catch our last issue because tests for large distributed systems are notoriously hard to get right.” And so on, and so on.
By now you’ve probably come to the same conclusion the people in this argument are bound to reach. Architecture is important in both the pre-production and post-production lifecycle stages. But that conclusion still leaves us in a peculiar spot because we don’t normally think about architecture or the role of an architect after something has been built. We don’t expect the architect who helped us build our house to show up at the doorstep a year later to say “OK, let’s do some more architecting.”
With the applications we build (or purchase) to run, things are different. There we have an expectation that the software will be changed at a much more rapid pace. It will be refactored, it will be enhanced, it will be upgraded. At each of these points, we must apply everything we know from the realm of architecture if we expect the result to be reliable. So let me tell you about one way to settle the debate we’ve been discussing, and also show you a tool that can help with your reliability even as we are squaring that circle.
The Azure Well-Architected Framework
The Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a workload. The framework consists of five pillars of architecture excellence: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security. Incorporating these pillars helps produce high-quality, stable, and efficient cloud architecture.
But there’s that word “architecture” again, basically sitting right in the middle of the name and taunting us with an image of an architect who only participates at the beginning of the lifecycle.
Here’s the key to unlocking this conundrum: For reliability (and the other four pillars) the goal is to work towards and remain in a “well-architected state.”
That’s a state that strives to embody and make use of the best practices and all the accumulated knowledge from architecture meticulously embedded in the Well-Architected Framework. This guidance is meant to be useful to you at all stages of a cloud solution. It is useful to you in the beginning when you are designing your workloads. It is useful to you when you begin your periodic review of the workload as part of the refactoring, scaling, enhancing, or upgrading process. And finally, it can help when the cycle starts anew for the next major version of your workload.
How to get there
Anyone who has worked in the reliability space, even for a short while, knows that while a large body of guidance like the Well-Architected Framework is great, the tricky part is applying that knowledge to your specific workloads and efforts in flight. Just navigating a large document set like the Well-Architected Framework and determining where to start can be a challenge. I’d like to introduce you to a tool that I believe can bridge your ground truth and the guidance we offer. It can serve as our compass to this material.
0 comments:
Post a Comment