How to Get Duped by Deduplication: Disregard Recovery

Deduplication is the process of removing duplicate data. In order to reconstitute deduplicated data so that it may be recovered, the inverse of the deduplication process must occur. This process, which is commonly termed “hydration” “re-hydration” or “re-duplication”, tends to have two negative consequences:

  • It takes time to transmogrify the data from the deduplicated state to the original state.
  • There is a risk that something could go wrong and data could be lost.

The “transmogrification” of the the data (I will admit to you that I first heard this word reading a Calvin and Hobbes cartoon years and years ago) is a fundamental consequence of the technology.  The best way to handle this on a technical basis is to use a technique known as reverse referencing – creating the deduplication index/cache from the last backup.  There are various ways to make this work, but the best way from the standpoint of recovery is to reverse reference and not allow deduplication across that last backup set.  This is the approach that the software developers at Unitrends took and was driven by a requirement that deduplication have the least negative impact on recovery times on the most frequent recovery case (recovering the last backup performed.)  This is one of the reasons (not the only one, but one of the reasons) that a hybrid compression/deduplication data reduction technique was invented for Adaptive Deduplication.

In order to decrease your risk, you have to pay attention to what your recovery time objectives are with respect to deduplication. You also have to focus on the underlying reliability of the physical storage on which the deduplicated data is stored.

The Complete Series: How to Get Duped by Deduplication