AWS Went Down, Black Swans, Antifragility, Backup, and Continuity
[Preface: This is the first in a four-part series that uses the AWS outage as a springboard to discuss our technology vision for backup and continuity. Just as we talked in earlier posts about Unitrends vision, mission, and values – this series of posts describes the technology vision of Unitrends.]
AWS went down for four hours on the last day of February (2017.) And all hell broke loose. But should all hell have broken loose?
AWS had an outage due to human error on February 28. The outage was centered around its S3 object storage; however other AWS services such as EC2, EBS, Lambda, encryption and content delivery also were affected. But beyond the technology, a lot of other companies were impacted. The most famous of these were companies like Slack, Trello, Sprinklr, and Venmo. But it’s estimated that there were tens of thousands of less well-known companies impacted as well. We learned some surprising things – Nest, acquired by Google, uses AWS and both the company and its customers were impacted by the outage.
At the core of the problem was unsurprisingly human error. Some poor engineer, we’ll call him Joe, was tasked with entering a command to shut down some storage sub-systems. On a typical day this doesn’t cause any issue whatsoever. It’s a routine kind of task, but on Tuesday something went terribly wrong.
Joe was an authorized user, and he entered the command according to procedure based on what Amazon calls “an established playbook.” The problem was that Joe was supposed to issue a command to take down a small number of servers on an S3 sub-system, but he made a mistake, and instead of taking down just that small set of servers, Joe took down a much larger set.
In layman’s terms, that’s when all hell broke loose.
So should all hell have broken loose?
AWS had a five-hour outage in 2015 that people still talk about. AWS will have future outages. Regardless of how much is invested, regardless of how “perfect” the technology is, regardless of how well trained the personnel are, AWS will have future outages. I was about to write that the only way AWS would not have an outage in the future would be if the Amazon went bankrupt or if the world ended – but of course, these would constitute the ultimate outage.
So why are we so bad at understanding and planning for these types of events? More on this tomorrow.