Unitrends and Veeam: Content-Aware Deduplication

[This post wasn’t originally in the Unitrends and Veeam series; I created it because of a great request by a reader “Marcuscassius” to discuss content-aware deduplication.]

Content-aware deduplication means that the deduplication engine of a backup system is aware of the content that is being protected and thus can optimize the deduplication of that content.  For example, if you know that the backup stream being ingested are files and know where file boundaries are, then your deduplication system can take advantage of that to align deduplication on file boundaries.  This applies to structured data as well – in other words, Exchange, SQL Server, and other applications have structured data stores that lend themselves to content-aware deduplication.  This is typically more aligned to approach than to whether your vendor is Unitrends or Veeam – it typically requires support of different backup strategies (not just agentless.)

The figures below depict the side-by-side the difference between content-aware deduplication and “content-ignorant” deduplication.  I will review a full backup and an incremental backup and the effect of content-awareness with respect to deduplication.  In each of these figures, the raw storage is depicted at the bottom (Underlying Storage), hypervisor-based storage above that (VMFS Block A and VMFS Block B – reflecting a VMware virtualization solution), operating system storage above that (OS Block 1, OS Block 2, and OS Block 3), and a file system above that (File X, two File Y’s and File Z.)  Both the left and the right have identical storage architectures – both depict virtualized storage.

A full backup occurs.

On the left, a content-aware backup leads to content-aware deduplication.  In this case, the backup and deduplication engines copy all files – and one file (File Y) is deduplicated since it is duplicate data.  On the right, a content-ignorant backup leads to no deduplication since no virtualized “block” of data is a duplicate block.  Now – clearly there is duplicate data – “File Y” – but because the core unit of deduplication is the hypervisor block this doesn’t show up in a content-ignorant backup and deduplication scheme.

Next, an incremental backup occurs.

Through the use of the color gold, we depict that “File Z” changes.  What occurs is that this file is copied into the backup system with the content-aware system (depicted on the left) while a larger block is copied into the backup system with the content-ignorant system (depicted on the right.)

Note that we have purposely shown a simple case of content-aware deduplication.  Files are actually not monolithic – they are broken up into segments (or blocks) themselves so that if one segment or block changes you only backup that which is changed.  The key here is that the concept of a file is implicitly understood by both the backup and the deduplication system – and thus both can be optimized around the content and not just a block.

Content-aware deduplication is one of those concepts that seems so clear and straightforward that the obvious question is why don’t all deduplication systems offer it.  The reason is that content-awareness deduplication is typically associated with agent-based backup while content-ignorant deduplication is typically associated with agentless backup.  Thus companies who don’t offer agent-based backups and who haven’t implemented modern deduplication have an incentive to ignore or obscure the topic.

Do the different techniques described above have an impact on recovery?  The techniques themselves do not.  However, there are fundamental differences in how agentless and agent-based backup work.  Agentless backup typically relies upon the hypervisor CBT (Change Block Tracking) engine for incremental backups; agent-based backups typically rely upon operating system based techniques.  Both techniques make use of the Microsoft VSS (VolumeShadow Copy Service) engines to quiesce and backup Windows-based systems in either agentless or agent-based systems.

The Unitrends adaptive inline deduplication is inherently a content-aware deduplication approach.  When you’re considering backup (and continuity) solutions, you need to ask your vendor (whether Veeam or any other vendor) what they do with respect to context sensitivity.  Be careful when a vendor tells you that the way that they approach backup is the best way when they only offer one strategy – it’s like having a hammer and the whole world thus looking like a nail.  Also be careful if a vendor (whether Veeam or any other vendor) says “yes” to this when what they’re actually doing is “re-deduplicating” not storage but replication and labeling that “WAN acceleration” or “WAN optimization” – it’s very different.

Thanks to the reader who asked about this.  If you have questions or concerns about this or any other topic, we’d love to hear from you.

 


WHITEPAPER: Deduplication and Continuity

Click Here to learn more about Adaptive Deduplication: Lower Storage Costs Combined with Faster Backup and Recoveries

 

Guide

MARKET-LEADING BACKUP AND RECOVERY SOLUTIONS

Discover how Unitrends can help protect your organization's sensitive data