With respect to backup, there has been more written concerning deduplication than almost any other subject.  I find it to be endlessly entertaining on a technical level – there’s some really smart people working in the area and there’s some really neat technology.  What I find most interesting, however, is how rarely deduplication is compared to other forms of data reduction.

A Data Domain DD510 has 3.75TB of physical storage and a $22K MSRP.  So the first question is what the effective storage capacity of the device is.  From the best studies and sources I can find, you are going to see a 4:1 deduplication ratio in short retention spans (under a month or two) and you’re going to see 10:1 deduplication ratios in longer retention periods.

If you believe some of the claims, you can get a 20:1 data reduction ratio from this system.  Of course it depends on the data.  Yes – I’ve read claims of 50:1 – but when discussing data reduction ratios the term “probability” must come into play.  There’s no doubt that variably-based block-level deduplication can achieve 50:1 – but then again there’s no doubt that it’s possible that monkeys may fly out of my rear end.  Both are improbable – but not impossible.

So going back to the probable, let’s assume 10:1 data reduction.  This means that my effective storage on my DD510 is 37.5TB.  This means that each terabyte of data costs me $584.

Now, let’s put together a decent NAS.  Let’s say I use 1.5TB drives at $126 per drive (latest pricing I saw looking this up on Google.)  I go get a 12-drive dual power supply chassis and a RAID-6 adapter and a motherboard – so I have an effective 15TB.  I use standard compression techniques and I’ll get 30TB.  Putting this together for $5000 is pretty easy; so each terabyte of data costs me $167.  Note that I have a significant advantage with the non-deduplicated NAS at a 10:1 data reduction rate; heck, I have a significant advantage with the non-deduplicated NAS at a 20:1 data reduction rate!

How do my ingest rates compare?  Unless you’re smoking crack, you understand that the reason deduplication vendors quote ingest rates is because there’s a problem.  If you do inline deduplication, there’s a big problem that is most ameniable to “grid” scaling multiple units (elegant but expensive.)  If you do post-processing deduplication, then if you have more data in your “landing zone” then you’re still at the mercy of the deduplication rate.  So you can assume that your ingest rate is going to be significantly better on your non-deduplicated system.  How about your recovery time?  Please.  You know the answer – re-duplicating data is slow.

Now, are they ripping you off on the DD510?  No – because it’s really expensive to put the huge memory and fast processors necessary to map the 3.75TB of storage on a block-by-block basis so that you can deduplicate.

Does this mean that deduplication isn’t a valid technology?  Of course not.  There are situations where it’s incredibly important – such as multi-year retention scenarios with a relatively low data change rate (so that ingest isn’t a problem) and a ton of  data.  It just means that it’s not yet an affordable technology when compared to simpler approaches.  It also means that most of the R&D that has gone into deduplication tends to focus on throwing hardware at the problem rather than thinking through the problem in a customer-focused affordable manner.

I think a better way to look at the problem is to consider all forms of data reduction.  For example, there are ways to do advanced compression that yield 4:1 ratios on unstructured data and 10:1 on structured data (disclaimer: the company I work for does this kind of thing.)  I also think that there are forms of data deduplication that are much, much demanding in terms of hardware resources but that can scale as the hardware resources scale.

Bottom line: if anyone tells you deduplication is a panacea, remember to do your homework.  Very effective technology (at least by some vendors) but a pretty expensive one.  As always, do your due diligence.

Comments

  1. If you are backing up a bunch of windows boxes, there are tons of OS and application files on every machine that will be duplicates. I would think file-level deduplication in cases like this could make great gains. I think most of the fains of deduplication could be had through file-level deduplication. What are the chances of a significant block of data in one random file being exactly duplicated in another file? Compare this to the chances that EVERY SINGLE windows box will have an identical copy Arial.ttf installed. That is where the simple file level deduplication should pay off.

  2. Boot from SAN (from certain vendors) and some Virtualization products address non-duclication of OS relatd files.

    Chris

  3. Agreed that boot from SAN addresses non-duplication of OS related files. So if you're going to a posture that all notebooks, PCs, workstations, and servers boot from SAN – in other words, you make a decision that the SAN is the only storage you'll support – this works well. On the other hand, if you have more storage than that found in the SAN, you're going to want something a bit more flexible.

    But I'd note that if you use a flexible approach for backup you're going to be able to exclude those system files anyway – so you can achieve “deduplication” by simply not “duplicating” in the first place.

    On virtualization, can you say a bit more about the virtualization addressing deduplication of OS related files? I'm aware of source-level deduplication in some of these products, but I'm not aware of anything that is OS specific.

  4. Boot from SAN (from certain vendors) and some Virtualization products address non-duclication of OS relatd files.

    Chris

  5. Agreed that boot from SAN addresses non-duplication of OS related files. So if you're going to a posture that all notebooks, PCs, workstations, and servers boot from SAN – in other words, you make a decision that the SAN is the only storage you'll support – this works well. On the other hand, if you have more storage than that found in the SAN, you're going to want something a bit more flexible.

    But I'd note that if you use a flexible approach for backup you're going to be able to exclude those system files anyway – so you can achieve “deduplication” by simply not “duplicating” in the first place.

    On virtualization, can you say a bit more about the virtualization addressing deduplication of OS related files? I'm aware of source-level deduplication in some of these products, but I'm not aware of anything that is OS specific.

Comments are closed.