Actual Unitrends and REDACTED VMware vSphere 5.5 Storage Deduplication for a Single VM and a Single VMDK

[Preface: Note that the although the data is real, the name of our competitor has been obscured, changed, and otherwise redacted to avoid hurting anyone’s feelings.  The original post in this series describes this in more detail.  I was also asked if this post referred to only VMware – it actually refers to Microsoft Hyper-V and other virtualization platforms.]

I had promised more posts on this topic a few weeks ago – this got delayed by a month I spent in Sydney with our new team over there.  I’ll be writing a lot more on that – and AWS, OpenStack, and other clouds in particular – in the next few months.  But there have been quite a few e-mails I’ve received on the chart above, so I thought I’d begin with one of the most common questions: what is GOS and HOS virtualization backup and why the difference in results between HOS and GOS VMware deduplication results above?

GOS backup is treating a VM like a physical server; HOS backup protects one or more VMs at the hypervisor level using hypervisor-level primitives such as VADP (VMware API for Data Protection.)

HOS-based virtualization backup has many clear advantages over GOS-based virtualization backup – we typically recommend HOS-level backup for virtualized environments.  However, there are a few cases in which GOS-based backup has advantages.  For more on this, see the extensive discussion in Dogma, Faith, and Fact: Do You Have to Pick a Virtual Backup Religion?  But – shame on me – I never raised deduplication (and its sister, replication – which is basically fueled by the efficacy of your deduplication implementation) in that article.

All HOS-level virtualization backup techniques of which I’m aware use a form of block-level deduplication.  What they do is carve the virtual machine into “blocks” of data and attempt to find matches of those blocks in prior backups so that storage can be saved.  These blocks can range from as low as 32KB or less to as high as 16TB or more.  Some techniques are based on fixed blocks, some on variably-sized blocks that are sized depending upon a number of circumstances.  Regardless, the basic techniques of HOS-level backup are typically independent of the underlying file system and application semantics of the underlying operating system and application within the VM.

GOS-level virtualization backup techniques that are aware of the objects that make up the storage have an advantage over HOS-level techniques in that they can deduplicate at the file system level on both a variable (file-based) and chunk (block-based within a file) level.  This gives an advantage to a well-implemented GOS-based deduplication algorithm – and you see that advantage above.

Now – with all of this said – you can implement HOS-level deduplication with wildly different levels of sophistication and efficacy.  The REDACTED vendor above does something called per-job deduplication – and it is really ineffective.   The next post in this series will discuss why we see such major differences between per-job deduplication and more accepted forms of deduplication (what some people call “global” or “storage” deduplication.)

As always, would love you hear your views on this or any other subject – and thanks to the dozen or so people who wrote asking me about this particular topic.