Cloud computing is capable of running millions of operations simultaneously. So it is very easy for one small glitch to create a domino effect of problems. It is easy to understand why
cloud failures can feel like they are impossible to resolve.
Are cloud failures really inevitable?
The Scale Problem
Clouds are massive and that massiveness also presents risk:
- When customers utilize shared infrastructure, and there is issue with customer's infrastructure. Then It will have a ripple effect on all of its customers.
- The vastness of complex systems means many of the underlying issues of those systems can remain hidden until they break apart.
- Automation operates at high velocity—fixes occur rapidly. However, unfortunately, breaks will also occur at great velocity.
Sounds scary, doesn't it?
Are Cloud Failures Really Inevitable?
I do not believe they are one hundred percent unpreventable! As systems become stressed (e.g.
traffic surges, errors in configuration or patches made under duress), they become susceptible to issues or events. Have you ever seen (or caused) a ‘minor change’ to cause major catastrophe? I have!
Making Mistakes: The Human Element
Humans design, develop, and maintain cloud systems.
AI and
machine learning cannot completely protect a system from a human-induced failure. Therefore, expecting no unplanned downtime is unrealistic.
How to Reduce the Risk of Cloud Outages
Cloud service teams who prioritize resilience rather than perfection will handle
cloud outages more effectively. Here are some simple, yet effective best practices:
- Utilize multiple geographic locations to limit the potential for widespread service failures.
- Perform frequent testing to identify potential elements of weakness in the systems.
- Develop comprehensive incident response plans to help the teams remain levelheaded during an outage.
None of the practices listed will eliminate outages completely; however, these methods can lessen the impact of outages and make the recovery process easier.
Summary
Significant cloud failures can and will occur; however, these major events do not need to result in catastrophe. Implementing the appropriate design of fast recovery will ultimately minimize the impact of failures. Additionally, a legitimate measure of
cloud service is based on the cloud’s ability to recover from a failure.