[This is a more detailed post that explores some of the concepts first discussed in the post Backup, Deduplication, and Incremental Forever]

Storage-based deduplication, particularly with respect to backup, is a paradox.  The intent of deduplication technology is to increase the amount of local storage available for backup.  Of course, all technologies have unintended consequences – in the case of backup deduplication some of these tend to be

  • An increase in the backup window due to a decrease in backup ingest performance.  To put this more simply, backups get slower because deduplication takes time.
  • An increase in risk of not being able to recover data because a storage failure of one bit of information can cause more than a single backup (and in many cases, quite a few backups) to be unable to be recovered.
  • Lower performance for functions such as disk or tape archiving since the data typically must be “reduplicated” (or expanded to its original size) before the archiving is performed.
  • Lower performance for functions such as recovery since the data must be “reduplicated” (see archiving discussion above.)

Each of these may be addressed to some degree technically – however, none of these is the toughest issue – because the toughest issue transcends the technical challenges of deduplication and gets at the business challenge of deduplication.  The toughest issue associated with deduplication is that you are substituting processor and memory cycles in place of adding storage.  Why is this the toughest issue?  Because in order to do this in a manner that benefits customers, you have to do it such that the overall price per terabyte of effective storage (the amount of storage that you see after whatever data reduction technique you’re using is taken into account) is the best for the customer.  And therein lies the rub.

There are two forms of deduplication being offered today.  In the first form, which is based on a dedicated hardware appliance (and in some cases a dedicated virtual appliance – but in effect it’s the same thing) there is a small amount of physical storage that is in effect made larger by using a great deal of processor and memory capacity.  The problem with this approach is that with 1TB 7200RPM SATA drives costing less than $90, a vendor has problems making the case that even at a 20:1 deduplication ratio that their $100,000 device is worth the money.

The other form of deduplication is the increasingly popular “software-only” approach currently being taken by software commodity backup vendors.  However, all this approach does is push the very tough decisions regarding data reduction optimization back to the user of the software.  The products don’t tend to work as advertised – not surprising given how difficult it is to match hardware and software optimally – and even less surprising given that most people don’t buy commodity software so that they can then spend a lot of money on creating an underlying state of the art hardware platform for backup.

Introducing Adaptive Deduplication

This is the reason Unitrends has chosen a different route for the deduplication feature we are releasing in May (in our Release 5.)  What we’ve done is to create a data reduction technology that takes advantage of inexpensive storage in order to gain the benefits of deduplication without requiring expensive processor/memory complexes – in other words, we’re able to deliver an affordable effective price per terabyte solution for our customers.

We do this by merging the advantages of advanced compression with deduplication and via predictive algorithms that take into account both the processor/memory capabilities and the amount of storage available so that automatically the optimal data reduction technology is chosen on a per-record basis.

The result?  We’re going to be bringing out deduplication that is more affordable than that associated with dedicated hardware vendors and yet more effective than that associated with commodity software providers.