• Hello and welcome! Register to enjoy full access and benefits:

    • Advertise in the Marketplace section for free.
    • Get more visibility with a signature link.
    • Company/website listings.
    • Ask & answer queries.
    • Much more...

    Register here or log in if you're already a member.

  • 🎉 WHV has crossed 72000 (72k) monthly views (unique) and 272000 clicks per month, as per Google Analytics! Thank you for your support! 🎉

Are Systemic Cloud Failures Inevitable? Understanding Cloud Outages and Risks

johny899

Member
Content Writer
Messages
1,069
Reaction score
3
Points
43
Balance
$122.4USD
Cloud computing is capable of running millions of operations simultaneously. So it is very easy for one small glitch to create a domino effect of problems. It is easy to understand why cloud failures can feel like they are impossible to resolve. Are cloud failures really inevitable?

The Scale Problem​

Clouds are massive and that massiveness also presents risk:
  • When customers utilize shared infrastructure, and there is issue with customer's infrastructure. Then It will have a ripple effect on all of its customers.
  • The vastness of complex systems means many of the underlying issues of those systems can remain hidden until they break apart.
  • Automation operates at high velocity—fixes occur rapidly. However, unfortunately, breaks will also occur at great velocity.
Sounds scary, doesn't it?

Are Cloud Failures Really Inevitable?​

I do not believe they are one hundred percent unpreventable! As systems become stressed (e.g. traffic surges, errors in configuration or patches made under duress), they become susceptible to issues or events. Have you ever seen (or caused) a ‘minor change’ to cause major catastrophe? I have!

Making Mistakes: The Human Element​

Humans design, develop, and maintain cloud systems. AI and machine learning cannot completely protect a system from a human-induced failure. Therefore, expecting no unplanned downtime is unrealistic.

How to Reduce the Risk of Cloud Outages​

Cloud service teams who prioritize resilience rather than perfection will handle cloud outages more effectively. Here are some simple, yet effective best practices:
  • Utilize multiple geographic locations to limit the potential for widespread service failures.
  • Perform frequent testing to identify potential elements of weakness in the systems.
  • Develop comprehensive incident response plans to help the teams remain levelheaded during an outage.
None of the practices listed will eliminate outages completely; however, these methods can lessen the impact of outages and make the recovery process easier.

Summary​

Significant cloud failures can and will occur; however, these major events do not need to result in catastrophe. Implementing the appropriate design of fast recovery will ultimately minimize the impact of failures. Additionally, a legitimate measure of cloud service is based on the cloud’s ability to recover from a failure.