Unitrends-VMware-deduplication

Actual Unitrends and REDACTED VMware vSphere 5.5 Storage Deduplication for a Single VM and a Single VMDK

[Preface: Note that the although the data is real, the name of our competitor has been obscured, changed, and otherwise redacted to avoid hurting anyone’s feelings.  The original post in this series describes this in more detail.  The second post in this series describes HOS vs GOS backup and deduplicaiton.  I was also asked if this post referred to only VMware – it actually refers to Microsoft Hyper-V and other virtualization platforms as well.]

Before I begin, a quick note about writing this blog post.  The toughest part was the title.  I found myself using Google to find euphemisms for “calling bullcrap.”  In that vein, let’s cut to the chase.

Per-job deduplication is incredibly, remarkably, inefficient at deduplication.  Don’t take my word for it – look at the data in the chart above.  The environment in which this is run should be the poster child for HOS- (i.e., hypervisor)-based per-job deduplication – it’s the best-case possible case to demonstrate why per-job deduplication is absolutely great.  I am not showing some complicated environment in which “real” deduplication (and by that, I mean either inline or post-processing deduplication that deduplicates across both jobs as well as time) has an advantage – although that would be fair enough to do.  Instead, I’m showing the storage footprint for a single VM, with a 5% change rate, being backed up and replicated by REDACTED and by Unitrends.  I’ve also run this at the same block size – 1MB – recommended for all the products for virtualized backup.

Note that this isn’t really fair – to real deduplication.  The fundamental weakness of per-job deduplication is that it doesn’t deduplicate across backup jobs.  Since the probability of having duplicate data on not only clones but on VMs with the same operating system are relatively high, per-job deduplication misses a lot unless you can put all VMs into a single backup.  What shocked me when I ran this test was the discovery that real deduplication beats per-job deduplication by a factor of 200% after four weeks – and it keeps getting worse as per-job deduplication has increasing problems finding duplicate data in prior backup jobs.

Have a different perspective?  As always, would love to hear from you.