Unitrends is supporting byte-level deduplication on its backup appliances with its release 6 (due this month – March, 2010.)  Since there’s some confusion concerning byte-level versus block-level deduplication, I thought I’d take the opportunity to explain the differences between file-level deduplication (what Unitrends supported prior to release 6), block-level deduplication, and byte-level deduplication.

File-level deduplication operates by eliminating redundant files.  Despite what many pundits state, file deduplication is very efficient (and note that I’m stating this even though Unitrends is bringing out byte-level deduplication with release 6.)  The reason is two-fold: the concept of temporal data access locality and the concept of data getting “colder.”  The bottom line to both of these concepts is that statistically data that has been recently used is more likely to be re-used again while data that hasn’t been recently used is less likely to be re-used.  Another consequence of this behavior is the reason that master/differential backup policies have held up so well over time – data usage temporarily tends to be “clumped” together.

The downside of file-level deduplication concerns data reduction on what is typically called “structured data.”  Structured data includes things like databases, e-mail repositories, virtual machine image backups, image-based dissimilar bare metal backups, and the like.  Since there are no files per se, file-level deduplication can’t eliminate redundant data for structured data.

Block-level deduplication has higher overhead than file-level deduplication but has the tremendous advantage of deduplicating structured data.  In addition, block level deduplication can deduplicate at a sub-file level, i.e., when only a section of a file changes block-level deduplication can often enable the unchanged section or sections of the file to continue to be deduplicated.

Byte-level deduplication is a form of block-level deduplication that understands the content, or “semantics”, of the data.  These systems are sometimes called CAS – Content Aware Systems.  Typically, deduplication devices perform block-level deduplication that is content-agnostic – blocks are blocks.  The problem of course is that certain blocks of data are much more likely to change than other blocks of data.  For backup systems, the “metadata” (data about data) that contains information about the actual backup tends to change continuously while the backup data statistically changes much less often.  The advantage to byte-level deduplication is that by understanding the content of the data the system can more efficiently deduplicate the bytes within the data stream that is being deduplicated.

Ironically, file-level deduplication is a form of byte-level deduplication since there must be some degree of content-awareness in order to detect a file versus some other form of data.  But of course the problem as described above is that file-level deduplication can’t handle unstructured data and can’t handle changes at the sub-file level.

What Unitrends has done with its release 6 is to create a byte-level deduplication system that is integrated with the backup appliance so that the appliance inherently deduplicates at the byte-level without the delays between the backup server and backup storage associated with dedicated deduplication devices.  And of course an integrated all-in-one backup appliance doesn’t force the customer to integrate products from different vendors (e.g., the server vendor, the backup software vendor, and the deduplication device vendor, and so on.)