AWS Went Down, Black Swans, Antifragility, Backup and Continuity [Part 3]
[Preface: This is the third in a four-part series that uses the AWS outage as a springboard to discuss our technology vision for backup and continuity. Part 1 discussed the AWS outage of February 28, 2017. Part 2 discussed the concept of black swans and this AWS black swan in particular.]
What is an antifragility, and what is antifragile IT infrastructure? Antifragile systems gain in capability, resilience, or robustness a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. Author Nassim Nicholas Taleb, who discussed in my last blog post, developed the concept of antifragility. Antifragility is neither resiliency or robustness. Resiliency is the ability to recover from failure; robustness is the ability to resist failure. Antifragility transcends both resiliency and robustness.
To put this as simply as possible:
- Under stress, fragile systems break while antifragile systems get better.
- Antifragile systems typically consist of fragile subsystems.
- Antifragile systems build extra capacity when under stress.
Examples of antifragile systems include
- Our immune systems strengthen when exposed to germs.
- Our skeletal and muscular systems strengthen when stressed with weight or resistance training.
- Airlines are stronger after a plane crash because the industry and vendors generally learn and adapt.
The trouble with these examples is that all are relatively long-term antifragile systems – in essence, the antifragile systems take a long time to exhibit antifragile behavior. In the broadest possible context, AWS will in the long-term behave as an antifragile system in that the fragile S3 components that had an outage and the system administrator tools associated with the outage are promised by AWS to be getting better. But this is a relatively uninteresting view of antifragility since it can be claimed that all companies can claim to be antifragile over the long-term.
A more interesting take on IT systems antifragility comes from the First International Workshop “From Dependable to Resilient, from Resilient to Antifragile Ambients and Systems” (ANTIFRAGILE 2014) in Vincenzo De Florio’s paper “Antifragility = Elasticity + Resilience + Machine Learning.” These three key components of antifragility: elasticity, resilience, and machine learning, seem to embody the core concepts of antifragile systems and antifragile IT infrastructure in particular.
Antifragile IT infrastructure must have some degree of elasticity such that the IT infrastructure has additional capability that can be deployed when under stress. The robustness we talked about at the beginning of this article is manifested through this elasticity. Antifragile IT infrastructure must have core resilience which enables recovery from subsystem failure. Machine learning is important in that it allows scalable and faster adaptability of the IT infrastructure under stress – in essence, machine learning augments the human learning from failure.
What does any of this have to do with backup and continuity? More on this in the last installment of this series which I’ll publish tomorrow.