Backup Concepts: Compression and Deduplication

There’s a lot of confusion concerning deduplication.  In this post I’ll discuss two primary data reduction concepts: compression and deduplication.

Compression

Compression is just the encoding of data using less data than that which made up the original. There are two fundamental types of compression: lossless and lossy compression. Lossless data compression means that you can recover every bit of your original data; lossy data compression means that your original data is lost in the compression process. For the purposes of our discussion, we’re only going to consider the lossless form of data compression.

Lossless data compression typically exploits the statistical redundancy of underlying data to represent the original data more concisely and yet with the ability to fully and accurately reconstitute that data at a later date when that data is uncompressed. Statistical redundancy exists because almost all real-world data isn’t random but instead have specific underlying patterns.

Here’s a trivial example. In standard character sets, all letters of an alphabet are represented by the same number of bits. In English, the letter “e” has a frequency of use of 12.7% while the letter “t” has a frequency of 9.06%. If it takes eight bits to represent each, then if you use fewer bits to represent these characters you’re going to be able to show a significant reduction just by encoding these two letters.

Compression is typically a trade-off between the utilization of the microprocessor and primary memory versus the utilization of secondary memory and a transmission line such as a WAN.

Deduplication

Deduplication is a specific form of lossless compression in which redundant data is eliminated. I realize that this sounds strange – deduplication is compression? The reason it sounds so strange is that there have been so many hundreds of millions of dollars spent representing deduplication as magical and revolutionary.

In deduplication, duplicate data is deleted leaving only one copy of the data to be stored. The deleted duplicate data is said to be “stubbed.” Stubbing is a process by which an indication that some data has been deleted and a pointer to the “index” of that data (the one copy that is not deleted) is put in place of the deleted data.

In deduplication, duplicate data is deleted leaving only one copy of the data to be stored. The deleted duplicatedata is said to be “stubbed.” Stubbing is a process by which an indication that some data has been deleted anda pointer to the “index” of that data (the one copy that is not deleted) is put in place of the deleted data.

Here’s a trivial example of deduplication. If you have ten servers, and each server has the same 1GB file, data deduplication across all of those servers should allow nine of those ten copies to be stubbed. That means your data has been deduplicated by a factor of 10:1.

While deduplication is a version of compression, it also is typically at odds with the more traditional version of compression. Deduplication tends to achieve better data reduction efficacy against smaller backup sets (the amount of data being backed up each time) while compression tends to achieve better results against larger data sets.