AWS Went Down, Black Swans, Antifragility, Backup and Continuity [Part 4]
[Preface: This is the fourth and last in a four-part series that uses the AWS outage as a springboard to discuss our technology vision for backup and continuity. Part 1 discussed the AWS outage of February 28, 2017. Part 2 discussed the concept of black swans and this AWS black swan in particular. Part 3 explored the concept of antifragile IT infrastructure.]
What is antifragile backup and continuity? And what does antifragile backup and continuity have to do with black swans and AWS outages? In this concluding blog post of this series, I’ll tie all these concepts together to argue that there is a new method for building antifragile IT infrastructure that has backup and continuity at its heart. In order to understand that, let’s go back to our last post and the equation: Antifragility = Elasticity + Resilience + Machine Learning.
Backup began solely as a mechanism to increase the resiliency of IT infrastructure. Remember from yesterday’s post: resiliency is the ability to recover from failure; robustness is the ability to resist failure. While backup has evolved tremendously over the years, most of this evolution focused on increasing the efficacy of resiliency.
This efficacy of resiliency is typically expressed as the RPO (Recovery Point Objective) and RTO (Recovery Time Objective) of the backup system. The RPO is the amount of data, measured in time, that one is willing to lose; the RTO is the amount of downtime that one is willing to tolerate. Both RPOs and RTOs have improved over the last few years – thus backup has had a positive impact on IT infrastructure resiliency. In addition, disaster recovery has further improved resiliency by enabling business to survive the loss of entire data centers. DRaaS (Disaster Recovery as a Service) offers disaster recovery for companies that don’t have multiple sites or for multi-site companies that want to lower their capital and operating expense.
Note: What about robustness? See machine learning for more on robustness; but as RPOs and RTOs have lowered, backup and continuity have increased overall business robustness in that the “blast radius” in terms of time has been contained. At the subsystem level, robustness remains the same; but for the business as a whole, backup and continuity technologies such as “instant recovery” have me transformative. Antifragile backup and continuity rely on increased resilience and robustness technology as its underlying foundation.
Elasticity generally refers to the ability of a system to add resources needed to dynamically cope with loads. It is typically associated with cloud computing. In cloud computing, a third-party purchases the IT infrastructure necessary for scaling a system up or down (see scalability below) dynamically and the business that consumes the cloud resources uses a pay as you go model. Note that elasticity has heavy marketing connotations as well since most of the hyperscale cloud providers give strong economic incentives to lock in (decrease your ability to scale down dynamically.)
It’s easy to get elasticity confused with scalability. A scalable system has traditionally been defined as one that you can continue to invest in to scale up (add more resources like CPU, memory, or disk to a single system for increased capability) or scale out (add more fully functional subsystems to a functioning system.) Elastic systems are a superset of scalable systems in that you can dynamically scale up and scale out as well as dynamically scaling down resources with capacity is no longer needed.
Traditional backup has always talked about scale up and scale out architectures; however, there has been little discussion of elasticity. Evolving services such as DRaaS (Disaster Recovery as a Service) are platforms for elasticity within backup and continuity via dynamic spinup, recovery assurance, and of course retention. Antifragile backup and continuity relies upon not only scale-up and scale-out locally for scalability but even more importantly for purpose-built backup and continuity clouds for true elasticity.
What is machine learning and how does antifragile backup and continuity rely upon it? Let’s start with a clear definition of machine learning beyond the marketing hype exploding around the term these days. Machine learning is building computer systems that have the ability to learn without being explicitly programmed. Machine learning differs from predictive analytics in that machine learning is a superset. At Unitrends, we’ve been using predictive analytics for years with our appliances to predict things like disk failures and to call customers before their disk fails and ask if we can send them a replacement. It has a high “Wow!” factor. But predictive analytics is just the beginning.
Using machine learning enables us to anticipate and remediate potential future failures prior to their occurring within our systems. We analyze log and other data, find anomalies, and make both human-based and machine-based changes for advanced security threats, IT performance problems, and business disruptions. Leveraging machine learning anomaly detection, we automate and orchestrate the analysis of massive data sets, eliminating manual effort and human error.
Antifragile backup and continuity is the heart of our technological vision that enables our customers to spend less time and thus be more productive. It represents a shift in thinking about backup that is far more profound than just enabling customers to be more highly available via “instant recovery” technologies or even more advanced “recovery assurance” automation and orchestration capabilities (both of which Unitrends offers today.)
As always would love to hear your thoughts.