I was recently speaking to a buyer about backup and global deduplication. Basically, I was told that global deduplication was a hard requirement (what we sometimes call a KPC – Key Purchase Criteria.) I told the buyer that we didn’t have global deduplication – and thus we didn’t meet their requirements. Well – because the overall price of our appliances were so much less than the solution that the buyer was looking at, they told me that as long as we had global deduplication on our roadmap, that they’d make the purchase. That led to an interesting discussion – and this post.
Global deduplication is a feature that some backup storage vendors (also called secondary storage vendors) have – Data Domain most famously. What it does is allow deduplication to work across the various storage “nodes” so that deduplication ratios are higher. So if you have two 10TB secondary storage devices, and you have global deduplication implemented across those, then in effect you theoretically have the same effect as having only one 20TB device. In other words, deduplication eliminates repeated sections of data across both the two 10TB secondary storage devices.
Local deduplication is the oppositive of global deduplication, and in that case where you have two 10TB secondary storage devices then repeated blocks eliminated separately on the two 10TB devices. Thus in the simplest case if you have a thousand identical files, five hundred of which are on one 10TB device and the five hundred files which are on the other 10TB device, you still have two copies of those files when you do local deduplication. In the case of global deduplication, you’d only have one file. So in this theoretical case, your deduplication ratio would be 1000:2 with local deduplication and 1000:1 with global deduplication.
Global deduplication solves a specific problem: how do you get higher deduplication ratios for an existing backup solution (i.e., an existing backup server and backup software.) The problem with this is that global deduplication requires some significant engineering – and these secondary storage backup device vendors tend to charge a pretty significant price in terms of $/TB (on a deduplicated but even more so a raw TB basis.) It can be difficult to get high real-world ingest rates with global deduplication – and the price of a new “shelf” for this type of storage is quite often the reason we get a call to “rip and replace” an existing backup solution. Even if we assume higher deduplication ratios, these are in every situation I’ve run across completely wiped out by the higher pricing. (Caveat: I work in the SMB space with companies that typically have no more than a few thousand employees all the way to small companies of ten or so people – so there’s a bias here – I think that from what I’ve read that if I were more associated with global 1000 companies with petabytes of data that I’d have more often run into situations where global deduplication might be remotely in the ballpark in terms of price.
[See part two of this series for a different approach to the problem.]