Deduplication

dedup.png

The purpose of deduplication is to reduce the amount of data storage necessary for a given data set. Deduplication’s goal is to store data only once, and use “pointers” to reference the stored data in all places that have the same information. You are able to literally store the data once, and reference it many times.

There are a few basic types of storage-oriented deduplication, each with advantages and disadvantages.

File-Level Deduplication

File-level deduplication operates by comparing files and storing only the first occurrence of a unique file. File-level deduplication algorithms must be aware of the data to the extent that it recognizes what a “file” is.

File-level deduplication is fairly effective. The reason for this effectiveness is the increasing tendency to create files that are not updated and often not accessed after the creation of the original file.

The problem with file-level deduplication is that it doesn’t handle changes within files—and remember that files include not only files within file systems but also what is commonly called structured data (e.g., databases and images.) Changes within a file, database, image, or other entity cause the entire entity to be marked as unique—thus no deduplication occurs.

Block-Level Deduplication

Block-level deduplication takes information from a backup in a “black box”. It looks at the content in grouped pieces of storage (aka Blocks) and identifies duplicated blocks regardless of where they occur.

Depending upon the implementation, block-level deduplication can work very well to reduce data. The problem with block-level deduplication isn’t the theoretical ability to perform data reduction—it’s the practical consequences of doing so. Imagine a block-level deduplication device with 20TB of physical storage. If your block size is 4K, that means you need a table with 5 billion entries to keep track of each block. If a 256-bit hash key is used to represent each block (hash keys are typically used in deduplication algorithms to provide a unique representation of the data) that means that you need 32 bytes per entry—or 128GB of dedicated memory for the tracking table.

Even with optimizations, block-level deduplication devices that perform acceptably tend to be expensive. And the performance has to be quite good—because if you’re using a dedicated block-level deduplication device you’re already going to be incurring performance penalties when you transfer the backup data from your backup server and software to the deduplication device.

Byte-Level Deduplication

Byte-level deduplication combines the resource efficiency of file-level deduplication with the data reduction effectiveness of block-level deduplication. Byte-level deduplication algorithms can evaluate objects more intelligently by looking at small pieces within the content that is being backup up, resulting in dramatically fewer tracking table entries per physical storage terabyte. This results in smaller physical resource requirements, including processors, memory, and physical storage.

Next generation backup appliances offer integrated byte-level deduplication at a lower cost without the performance penalties associated with moving data from the backup server to a secondary dedicated deduplication device.

INLINE DEDUPLICATION

Inline deduplication removes duplicate data immediately on ingest reducing the amount of storage capacity required to store backup data, which extends retention time. Inline deduplication is particularly efficient on large data sets with small data changes. Additionally, since inline deduplication removes duplicate data on ingest, redundant data can be eliminated across multiple backup data streams simultaneously, thus improving deduplication efficiency. Furthermore, for critical data that needs to be quickly replicated to a disaster recovery (DR) site, inline deduplication allows data to be immediately replicated across the WAN, improving Recovery Point Objectives (RPOs).New generation backup appliances are combining inline deduplication and byte-level deduplication in order to minimize capacity requirements, meet tight backup windows and maximize replication performance across all data types in a single integrated cost effective solution. 

 





Deduplication Devices are Backup Appliances, the Check is in the Mail, and Other Lies

Deduplication Devices are Backup Appliances, the Check is in the Mail, and Other Lies

Recently I've run into some marketing material stressing from deduplication device vendors that what they are selling are "backup appliances." This reminds me of the story of the car salesman selling a car and then later admitting that the engine and steering wheel and seats are "options" that will cost more. By this logic Seagate could sell their disk drives as "laptop computers" to school children to use in class. I don't know what offends me worse—the blatant lie or the assumption some marketing person made that their intended buyer is either too stupid or too bored to be able to tell the difference.

Thank you for your interest in Deduplication Devices are Backup Appliances, the Check is in the Mail, and Other Lies


Last Updated: 06/18/2013
Categories: White Paper Recovery-Series