This is part three of the “Well-Architected in Serverless” series where we decipher the AWS Well-Architected Framework (WAF) pillars and translate them into real-life actions. In this article, we will focus on the AWS WAF Reliability (REL) pillar.



The Reliability Pillar

Unlike the Operational Excellence (OPS) and Security (SEC) pillars, the REL pillar is tradable. You can trade its goals for getting more out of the remaining two pillars: the Cost Optimization (COST) and the Performance Efficiency (PERF) pillars.

Trading means that you don’t have to go all-in in every one of these pillars. Maybe you want to save money, so you don’t do global replications, which would make your system more reliable but also more expensive.

The same goes for the PERF pillar; maybe you want to be as reliable as possible, this can imply that you wait for eventually consistent data storage to do its thing before you respond to a client, which makes your system more reliable in terms of a crash, but also slower in terms of performance.

The three parts that make up the REL pillar:



Foundations

The foundation of the REL pillar is the knowledge of quotas and constraints of the services you use. If you make a system unreliable because of a bug, that’s one thing, but if you didn’t know that a service is eventually consistent, you have a greater problem. This is also true for forgetting that you can only send a specific amount of requests per time frame to a service.

Luckily, for AWS services, some tools can help with that. The AWS Service Quotas Console can give you insights about each AWS service and even notify you when your systems hit the limits of the services they’re using. The AWS Trusted Advisor could also help to find out how much of a service you already used.



Managing Foundations

Dashbird integrates with the majority of the popular managed services in AWS to provide alerting and warning notifications for when the usage of a service reaches any sort of limits, such as timeouts, throttling, out of memory, and the like. In addition, developers can implement custom alarms and policies for use cases specific to their environment. Moreover, the platform visualizes the limits of services to grasp the state of resource usage easily and understand the capacity and long-term threats to the system.



Change Management

Change management is the anticipation of changes to your serverless system. This means how customers are changing their system’s usage patterns and how you change your system in terms of code.

Examples of this are traffic spikes, which are usually handled automatically by a serverless system because it can scale out automatically. Still, change management also includes new features you want to deploy or migrations when you change databases.



Staying on top of Change Management

Dashbird gives engineering teams confidence and the ability to iterate quickly. A large factor in this is the reduction of the time it takes to detect and respond to incidents. Another topic that Dashbird helps with is getting real-time visibility into the inner workings of serverless applications. Developers can use this functionality to monitor the service at critical times and measure the performance, cost, and quality impact of system changes.



Failure Management

Failure management is about what you do when things fail, and they will fail because nothing is forever. Serverless services, especially managed ones, provide much of the failure management, for low-level issues, out-of-the-box, but this doesn’t mean that everything will keep working indefinitely.

Serverless systems are often event-based and utilize asynchronous communication rather heavily. In essence, this means if you send a request to an API, it might not respond with the actual result but just tells you that it accepted your request and will now start to process it. Now, if something goes wrong along the way, you have no direct way of finding out about the client that sent the request.

To make sure nothing gets lost, you need to keep track of your events. Implement retry logic for your Lambda functions with dead-letter queues and log what went wrong.



Staying on top of failures

Dashbird helps you monitor SQS queues and provides functionality to set alarms for DLQs.

Dashbird SQS Retry Screenshot



Maintaining reliability

A serverless developer needs a tool that automatically monitors for known and unknown failures across all managed services. Dashbird platform provides engineering organizations with end-to-end visibility into all monitoring data across cloud-native services (logs, metrics, and traces in one place) combined with an automatic failure detection functionality, identifying know and unknown failures as soon as they happen.

Dashbird failure monitoring screenshot



SAL Questions for the Reliability Pillar

There are two serverless related questions about the REL pillar in the SAL. Let’s look into them.



REL 1: How are you regulating inbound request rates?

Your serverless applications will have some kind of entry point, a front door, so to say, where all external data comes into your system. AWS offers different services to facilitate this, one is API Gateway, and another one is AppSync.

These services, like all the other services you’ll be using downstream, have their limits. It can lead to reliability issues if you rely on these limits alone. If your system gets sufficiently complex, it’s not easy to calculate what service will fold first.

That’s why you should set up adequate throttling for API Gateway and AppSync. These services also allow defining usage plans for issued API keys; that way, you can clearly communicate how much a customer can expect from your system.

It’s also crucial to use concurrency controls of Lambda because it can scale faster than most services. If you integrate with a non-serverless service and suddenly your Lambda function scales up to thousands of concurrent invocations, it will be like a distributed denial-of-service (DDoS) attack.



REL 2: How are you building resiliency into your serverless application?

The main lever for increasing resiliency is decoupling of logic and responsibility between resources and designing the system to handle failures on its own. In most use cases, as much as possible should be made asynchronous. This is a great post outlining the design principles for building resilience into serverless applications.

In addition to system design, it’s important to have tools and processes to measure and track system activity and to get notified of unexpected events in reasonable time windows. No system will be 100% resilient and have the ability to recover from any failure. Engineering teams building on serverless should be responsible for testing their system with different failure scenarios and make continuous improvements and modifications, constantly learn from past incidents and thrive to develop the most optimal processes and tools to respond to incidents.



Summary

The REL pillar is all about designing your system in a way that won’t break down. Learn about the services quotas and limits. Sometimes a service sounds like just what you need before reading that it can’t handle more than 1000 requests per second. Throttle your systems entry-points so clients can’t overload downstream services and give customers clear answers on what they can expect from your system.

Also, keep everything monitored. The inherent asynchronicity of serverless systems makes them less straightforward to debug when something has gone wrong; this means you need a way to get notified when things go out of bounds so you can react quickly. This also means you need logging data to evaluate what has gone wrong after an incident.

You can find out more about building complex, Well-Architected serverless architectures in our recent webinar with Tim Robinson (AWS):