Deduplication, Windows Server 2012, the 1990s, and Seinfeld
I love Spiceworks – just can’t get enough of it. A great debate has been going over there by a virtualization-only vendor with respect to backup, Windows Server 2012 (WS2012), deduplication, and a whole set of issues. Our very own Maria Pressley, who is Chief Evangelist (she ran pre-sales before that, and engineering before that – so she tends to be pretty sharp) waded into the fray as well.
I tend to be pretty careful what I post on Spiceworks, so I decided that it would be better to simply use this blog to respond to a few things that Maria already responded to in a Spiceworks thread – with perhaps an additional point or two. So here goes…
* ow, you need an agent inside the VMs just to do backups? the 90’s just called 🙂
Unitrends actually doesn’t need an agent inside the VM to do backups. (Note: I’m going to ignore the fact that if you want to do management tasks within the VM that virtualization-only vendors need some code (for more on this, see Everybody Lies: Backup and Secret Agents) – instead let’s get at the heart of the issue. Unitrends isn’t a “religiously fanatical” company that says you can only do backup one way. We support HOS-level backup, as virtualization-only vendors do, and we support GOS-level backup as well. In other words, we support not only 100% virtual environments but physical environments as well.
I’m not sure what the 1990’s have to do with anything, but I always thought it was a step back when vendors limited choice (only virtualization, no tape, whatever) rather than embracing the fact that IT administrators have a huge job in front of them in terms of being both responsive and proactive in terms of agile IT. Obviously there are a lot of folks who buy data protection limited to 100% virtualized environments, and – to use the words of a 1990s sitcom
(Note: Maria does a great job of pointing out all the things you can choose to do at the GOS level that you can’t do at the HOS level – encourage anyone here to read the Spiceworks thread – but it really just comes down to the old saying about when you have a hammer, everything looks like a nail. When you’re virtual only, you tend to look at the world through virtual-only lenses.
* as you are doing it “above the file system” this implies you do not care about the footprint this customer has already saved but you are rehydrating the data in full, sending that over the wire and afterwards deduplicating it back your own way.
Actually this is a good point concerning rehydration and the choice whether to use HOS- or GOS-level backup. I can absolutely see doing VMware vSphere or Microsoft Hyper-V with a WS2012 deduplicated file system being much more effective at the HOS (block level.) Of course, if I wished to exclude a 9TB file within a 10TB VHDX, then the rehydration penalty associated with GOS-level backup doesn’t look very bad. This is the reason that I tend to talk trade-offs rather than having a binary, black or white, point of view in backup – I think it typically comes down to flexibility.
Now – you might think – isn’t this convenient for a guy whose company offers both to feel this way? Of course. But then again, it’s why I chose this company.
* as you are talking about global dedupe on your target device only, this also implies that the footprint saved in any way cannot leave the system. If you want parts of your backups (ex. only weekly) offsite you’ll have to rehydrate again and do the same on the other side. Because [virtualization-only vendor] was so smart to do the inline dedupe only on job level, you can take the file anywhere in that state.
Fascinating. By this argument, deduplication is always bad. Think about it. A file exists alone. A job is a container for a file. If you do no deduplication, then you can carry that file anywhere. If you do deduplication at the job level, you can take the job collection of files around and rehydrate as needed. So the maximum in portability is no deduplication.
But the reason that people care about deduplication is data reduction. The more efficient your data reduction, the more deduplicated blocks you have. Thus using global deduplication (or for that matter, any form of deduplication that is time-based rather than job-based) means you’re typically going to have a higher data reduction ratio.
A smarter way to do this is to either rehydrate before you archive (or whatever) or carry around with you a subset of the set of deduplicated blocks for later rehydration.
I’ve got to admit – a pretty interesting argument – that lower deduplication data reduction ratios are good. Incredibly, insanely, creative.
In any case, that’s what I think. Would appreciate any insight or issues from anyone.