[Preface: This is the second in a four-part series that uses the AWS outage as a springboard to discuss our technology vision for backup and continuity. Part 1 discussed the AWS outage of February 28, 2017.]
AWS black swans? So AWS had a major outage in February, and in 2015, and quite a few times both in between and prior. But what does that have to do with swans – black swans at that? And what does that have to do with backup and continuity? For more on this, read on…
Author Nassim Nicholas Taleb has done more than anyone to explore the phenomenon that human beings have a tendency to be blind to unpredictable and relatively rare events and to in hindsight inappropriately rationalize the event with the benefit of hindsight. Taleb calls these “black swan event.” (Author’s note: I’m a big fan of Taleb and strongly recommend all of his work – I think that his work lays the foundation for the entire backup and disaster recovery industry with respect to IT infrastructure.) Not all AWS disruption events may legitimately be called black swan events, but the 2015 and now February 2017 outages appear to have all of the characteristics of AWS black swans.
AWS responded to its black swan event responsibly with a published post-mortem that includes the following promised fixes
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.
This is typically how organizations deal with black swans. They have identified what went wrong and taken steps to prevent it in the future as well as enabling faster RTOs (Recovery Time Objectives.) The charmingly named “blast radius” reduction is, in essence, a promise to decrease coupling between systems – always a good idea in resilient systems.
Is there a better way to handle AWS black swans – and data center and IT infrastructure black swans in general? More on this tomorrow.