“For two years now, my ongoing Advancing Reliability blog series has highlighted initiatives and investments underway to keep improving the reliability of our Azure platform and services. But what about your own Azure applications that run on top of these—how should you think about establishing and improving the reliability of your own architecture? We created the Azure Well-Architected Framework to help improve the quality of your workloads, and reliability is one of its five core pillars so for the latest post in our series, I have asked Cloud Advocate David Blank-Edelman to run through how best to approach using the framework to guide your conversations and design decisions in this space.”—Mark Russinovich, CTO, Azure
If you want to start a good discussion or argument about reliability at work, ask a colleague this question.
“When is architecture more important for the reliability of a service, product, or application? Before it is deployed to production, or afterward?”
Well, “surely”—you say—“if we don’t build the service with reliability in mind, it may not have the right components included to increase stability. It may not have redundancy to improve fault tolerance. Perhaps we will have left out robust retry logic, circuit breakers, or other known patterns for reliable systems.”
But maybe your colleague counters, “Well, I can’t deny that it is important to attempt to try and build things right from the beginning. But one thing I’ve learned about reliability is it is almost never achieved on the first go around. Even if you have done a phenomenal job at the whiteboard, designing with failure in mind, there are still going to be outages. And while nobody likes outages, if we handle them and a subsequent post-incident review correctly, we can learn a great deal that helps us make a service more reliable in the long term. On top of this, wouldn’t you agree that observability is an iterative process that involves changing what we measure and monitor as we learn more about the system while it is running? All these things would fall under the mantle John Reese and Niall Murphy called ‘the wisdom of Production’. And all of these things surely need us to bring to bear all the architecture skills we have to do this right.”
If you are having a really good discussion, this goes back and forth across the table at least a few times. One side notes that “bolting on reliability after the fact” works about as well as “bolting on security after the fact” (that is to say, not well at all). The other side might bring up the lessons we’ve learned from chaos engineering showing us that experiments on a dev or staging environment can be very useful, but they don’t always yield some of the unique results we get from testing in production.
“But what about the value of continuous integration and continuous delivery (CI/CD) to reliability—trying to catch reliability issues before they get to production?”, gets asked. Then in response, “CI/CD is tremendously useful, but it didn’t catch our last issue because tests for large distributed systems are notoriously hard to get right.” And so on, and so on.
By now you’ve probably come to the same conclusion the people in this argument are bound to reach. Architecture is important in both the pre-production and post-production lifecycle stages. But that conclusion still leaves us in a peculiar spot because we don’t normally think about architecture or the role of an architect after something has been built. We don’t expect the architect who helped us build our house to show up at the doorstep a year later to say “OK, let’s do some more architecting.”
With the applications we build (or purchase) to run, things are different. There we have an expectation that the software will be changed at a much more rapid pace. It will be refactored, it will be enhanced, it will be upgraded. At each of these points, we must apply everything we know from the realm of architecture if we expect the result to be reliable. So let me tell you about one way to settle the debate we’ve been discussing, and also show you a tool that can help with your reliability even as we are squaring that circle.
The Azure Well-Architected Framework
The Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a workload. The framework consists of five pillars of architecture excellence: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security. Incorporating these pillars helps produce high-quality, stable, and efficient cloud architecture.
But there’s that word “architecture” again, basically sitting right in the middle of the name and taunting us with an image of an architect who only participates at the beginning of the lifecycle.
Here’s the key to unlocking this conundrum: For reliability (and the other four pillars) the goal is to work towards and remain in a “well-architected state.”
That’s a state that strives to embody and make use of the best practices and all the accumulated knowledge from architecture meticulously embedded in the Well-Architected Framework. This guidance is meant to be useful to you at all stages of a cloud solution. It is useful to you in the beginning when you are designing your workloads. It is useful to you when you begin your periodic review of the workload as part of the refactoring, scaling, enhancing, or upgrading process. And finally, it can help when the cycle starts anew for the next major version of your workload.
How to get there
Anyone who has worked in the reliability space, even for a short while, knows that while a large body of guidance like the Well-Architected Framework is great, the tricky part is applying that knowledge to your specific workloads and efforts in flight. Just navigating a large document set like the Well-Architected Framework and determining where to start can be a challenge. I’d like to introduce you to a tool that I believe can bridge your ground truth and the guidance we offer. It can serve as our compass to this material.
Here’s a screenshot from a tool we call the Azure Well-Architected Review:
The Well-Architected Review is a self-guided assessment tool that will walk you through the Well-Architected Framework reliability pillar and the other four Well-Architected Framework pillars. This is a great process to do either by yourself, with your friendly neighborhood Cloud Solutions Architect, or supporting partners. It will ask you a set of questions about your reliability efforts—then, based on your responses, it offers suggestions on areas to focus on with direct links to our WAF documentation on those areas.
Here’s an example set of results:
Let me offer a few tips that might not be obvious at first look for getting the most out of the Well-Architected Review:
1. Pay attention to the questions: You might think the results of the review are the biggest reward, but I’m here to tell you that the most valuable thing you may be able to take away from the review is the questions. Reliability can be a tricky area to tackle because there are so many possible ways to begin working on it, so many different places to start. Just knowing which questions to ask can be difficult. The Well-Architected Review can give you those questions.
2. Return to the review again and again: If you sign into the review platform with your Microsoft credentials, you can save the results. This means that in six months, or whenever you feel ready to conduct another review, you will be able to compare your new review to your previous information. This can be tremendously helpful for judging your progress across each pillar.
3. Share the results with your team: One thing many people don’t know about the Well-Architected Review is if you have signed in (see tip number two above), it will allow you to export your results as a Microsoft PowerPoint presentation. Take this draft, customize it, and you now have a ready-made presentation to take to your next team meeting so everyone can get behind your reliability efforts.
The Well-Architected Framework in action
If you would like to see some examples of the Well-Architected Framework in action, including some excellent sessions about reliability, I encourage you to check out the videos in the Well-Architected series of our Azure Enablement show. There’s some good online course material about the subject in Microsoft Learn, and guiding principles in our documentation. If you want to dive deeper into the architecture side of Well-Architected, I recommend checking out the Azure Architecture Center.
I look forward to hearing how your discussions about reliability and architecture go.