Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 23 additions & 23 deletions docs/non-functional-requirements/reliability.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,71 @@
# Reliability

All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensures code is properly tested, and helps remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on.
All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensure code is properly tested, and help remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on.

However, there are some additional steps we can take, that don't neatly fit into the previous categories, to help ensure a more reliable solution. We'll explore these below.
However, there are some additional steps we can take that don't neatly fit into the previous categories to help ensure a more reliable solution. We'll explore these below.

## Remove "Foot-Guns"

Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person, it's the collective fault of the system to not prevent that mistake from happening.
Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person; it's the collective fault of the system for not preventing that mistake from happening.

Check out the below list for some common tooling to remove these foot guns:
Check out the list below for some common tooling to remove these foot guns:

* In Kubernetes, leverage [Admission Controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) to prevent "bad things" from happening.
* You can create custom controllers using the Webhook Admission controller.
* [Gatekeeper](https://github.com/open-policy-agent/gatekeeper) is a pre-built Webhook Admission controller, leveraging [OPA](https://github.com/open-policy-agent/opa) underneath the hood, with support for some [out-of-the-box protections](https://github.com/open-policy-agent/gatekeeper-library/tree/master/library)

If a user ever makes a mistake, don't ask: "how could somebody possibly do that?", do ask: "how can we prevent this from happening in the future?"
If a user ever makes a mistake, don't ask: "How could somebody possibly do that?", do ask: "How can we prevent this from happening in the future?"

## Autoscaling

Whenever possible, leverage autoscaling for your deployments. Vertical autoscaling can scale your VMs by tuning parameters like CPU, disk, and RAM, while horizontal autoscaling can tune the number of running images backing your deployments. Autoscaling can help your system respond to inorganic growth in traffic, and prevent failing requests due to resource starvation.

> Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod however, may also need autoscaling to handle an increase in the number of Pods.
> Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod, however, may also need autoscaling to handle an increase in the number of Pods.

It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate.
It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate, may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate.

## Load shedding & DOS Protection

Often we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional, and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem). A misconfiguration results in an error which triggers clients to retry uncontrollably. Requests append to a stored object until it is so big that future reads crash the server. The list goes on.
Often, we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem). A misconfiguration results in an error, which triggers clients to retry uncontrollably. Requests are appended to a stored object until it is so big that future reads crash the server. The list goes on.

Follow these steps to protect yourself:

* Add a jitter (random) to any action that occurs from a non-user triggered flow (ie: add a random duration to the sleep in a cron, or job that continuously polls a downstream service).
* Add a jitter (random) to any action that occurs from a non-user-triggered flow (ie, add a random duration to the sleep in a cron, or job that continuously polls a downstream service).
* Implement [exponential backoff retry policies](https://en.wikipedia.org/wiki/Exponential_backoff) in your client code
* Add load shedding to your servers (yes, your internal microservices too).
* This can be configured easily when leveraging a sidecar like envoy.
* Be careful when deserializing user requests, and use buffer limits.
* ie: HTTP/gRPC Servers can set limits on how much data will get read from the socket.
* ie, HTTP/gRPC Servers can set limits on how much data will get read from the socket.
* Set alerts for utilization, servers restarting, or going offline to detect when your system may be failing.

These types of errors can result in Cascading Failures, where a non-critical portion of your system takes down the entire service. Plan accordingly, and make sure to put extra thought into how your system might degrade during failures.

## Backup Data

Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes, or losing encryption keys.
Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes or losing encryption keys.

Consider things like:

* How long will it take to restore data.
* How much data loss can you tolerate.
* How long will it take you to notice there is data loss.
* How long will it take to restore data?
* How much data loss can you tolerate?
* How long will it take you to notice there is data loss?

Look into the difference between **snapshot** and **incremental** backups. A good policy might be to take incremental backups on a period of N, and a snapshot backup on a period of M (where N < M).

## Target Uptime & Failing Gracefully

It's a known fact that systems cannot target 100% uptime. There are too many factors in today's software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target some number of "9's" of uptime. ie: 99.99% uptime means that the system has a "budget" of 4 minutes and 22 seconds of uptime each month. Some months might achieve 100% uptime, which means that budget gets rolled over to the next month. What uptime means is different for everybody, and up to the service to define.
It's a known fact that systems cannot target 100% uptime. There are too many factors in today's software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target a number of "9's" of uptime. i.e., 99.99% uptime means that the system has a "budget" of 4 minutes and 22 seconds of downtime each month. Some months might achieve 100% uptime, which means that the budget gets rolled over to the next month. What uptime means is different for everybody, and it is up to the service to define.

A good practice is to use any leftover budget at the end of the period (ie: year, quarter), to intentionally take that service down, and ensure that the rest of your systems fail as expected. Often times other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully.
A good practice is to use any leftover budget at the end of the period (i.e., year, quarter) to intentionally take that service down and ensure that the rest of your systems fail as expected. Often, other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully.

We can build graceful failure (or graceful degradation) into our software stack by anticipating failures. Some tactics include:

* Failover to healthy services
* [Leader Election](https://en.wikipedia.org/wiki/Leader_election) can be used to keep healthy services on standby in case the leader experiences issues.
* Entire cluster failover can redirect traffic to another region or availability zone.
* The entire cluster failover can redirect traffic to another region or availability zone.
* Propagate downstream failures of **dependent services** up the stack via health checks, so that your ingress points can re-route to healthy services.
* **Circuit breakers** can bail early on requests vs. propagating errors throughout the system.
Consider using a well-known, tested library such as [Polly](https://github.com/App-vNext/Polly) (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns.
Consider using a well-known, tested library such as [Polly](https://github.com/App-vNext/Polly) (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns. See Microsoft’s [Resilient Coding Patterns](https://github.com/microsoft/resilient-coding-patterns) for concrete implementations of patterns like retries, timeouts, circuit breakers, bulkheads, and load shedding.

## Practice

Expand All @@ -77,23 +77,23 @@ No software service is complete without playbooks to navigate the developers thr

### Run Maintenance Exercises

Take the time to fabricate scenarios, and run a D&D style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the "players" to navigate to a dashboard and describing would they would see in the fabricated scenario (small amounts of imagination required). The playbooks should **easily** navigate the user to the correct solution/mitigation. If not, update your playbooks.
Take the time to fabricate scenarios and run a D&D-style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the "players" to navigate to a dashboard and describing what they would see in the fabricated scenario (small amounts of imagination required). The playbooks should **easily** navigate the user to the correct solution/mitigation. If not, update your playbooks.

### Chaos Testing

Leverage automated chaos testing to see how things break. You can read this playbook's [article on fault injection testing](../automated-testing/fault-injection-testing/README.md) for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as [this section in the article linked above](../automated-testing/fault-injection-testing/README.md#chaos) have more details on available platforms and tooling for this purpose:
Leverage automated chaos testing to see how things break. You can read this playbook's [article on fault injection testing](../automated-testing/fault-injection-testing/README.md) for more information on developing a hypothesis-driven suite of automated chaos tests. The following list of chaos testing tools, as well as [this section in the article linked above](../automated-testing/fault-injection-testing/README.md#chaos) has more details on available platforms and tooling for this purpose:

* [Azure Chaos Studio](https://learn.microsoft.com/en-US/azure/chaos-studio/chaos-studio-overview) - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.
* [Chaos toolkit](https://chaostoolkit.org/) - A declarative, modular chaos platform with many extensions, including the [Azure actions and probes kit](https://github.com/chaostoolkit-incubator/chaostoolkit-azure).
* [Kraken](https://github.com/openshift-scale/kraken) - An Openshift-specific chaos tool, maintained by Redhat.
* [Chaos Monkey](https://github.com/netflix/chaosmonkey) - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).
* Many services meshes, like [Linkerd](https://linkerd.io/2/features/fault-injection/), offer fault injection tooling through the use of their sidecars.
* Many service meshes, like [Linkerd](https://linkerd.io/2/features/fault-injection/), offer fault injection tooling through the use of their sidecars.
* [Chaos Mesh](https://github.com/chaos-mesh/chaos-mesh)
* [Simmy](https://github.com/Polly-Contrib/Simmy) - A .NET library for chaos testing and fault injection integrated with the [Polly](https://github.com/App-vNext/Polly) library for resilience engineering.
[This ISE dev blog post](https://devblogs.microsoft.com/ise/build-test-resilience-dotnet-functions/) provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.

## Analyze All Failures

Writing up a [post-mortem](https://en.wikipedia.org/wiki/Postmortem_documentation) is a great way to document the root causes, and action items for your failures. They're also a great way to track recurring issues, and create a strong case for prioritizing fixes.
Writing up a [post-mortem](https://en.wikipedia.org/wiki/Postmortem_documentation) is a great way to document the root causes and action items for your failures. They're also a great way to track recurring issues and create a strong case for prioritizing fixes.

This can even be tied into your regular Agile [restrospectives](../agile-development/ceremonies.md#retrospectives).
This can even be tied into your regular Agile [retrospectives](../agile-development/ceremonies.md#retrospectives).