Overview of Source Versus Target Deduplication for Backup

Source deduplication means that deduplication occurs at the client (i.e., the server, PC, workstation – but not at the SAN or NAS) being protected while target deduplication means that deduplication occurs on the backup medium. Note that “backup medium” can mean the server upon which the backup software resides, on a dedicated deduplication device attached to that backup server, or on a backup appliance.

Source deduplication has the advantage of using less LAN (on-premise) bandwidth but has the disadvantage of using more client resources; target deduplication has the advantage of using less client resources but has the disadvantage of using more LAN bandwidth.

How To Best Use Source Versus Target Deduplication

I’ve heard everything from “source deduplication sucks” to “target deduplication sucks” (note: it’s much rarer to hear “deduplication sucks” – although if you read behind the lines of what some of the dedicated compression vendors are really saying you’ll see that point of view as well.)  The truth, as John Kerry would say, is more nuanced.

If you look at long-term trends, then you’d have to believe that source-level deduplication is the long-term answer since processor/memory/disk performance/density are increasing so much faster than LAN bandwidth/latency.  The trouble with long-term trends is that as John Keynes famously noted, in the end we’re all dead.  In other words, quite often long-term trends don’t tell you much.  There is absolutely a need for primary storage deduplication (which has very little traction to date) to incorporate source-level deduplication.  In the case of backup, techniques such as incremental forever backup dramatically reduce the advantages of source-level deduplication.  Note the term “dramatically reduce” – because source-level deduplication still helps even with techniques like incremental forever backup methodologies, just to a radically lesser degree.

What compounds the problems of source-level deduplication is that there are always software technologies coming along that increase the utilization of processor/memory/disk systems.  Virtualization is the big kahuna of examples for this.  Since source-level deduplication competes for those same resources, you find that a lot of businesses shy away from the resource contention at their server, PCs, workstations, notebooks, and whatever else that they’re protecting.

Backup Source and Target Deduplication and Recovering/Restoring Data

Another reason that you’re not going to see target deduplication go away is that advanced backup systems and their smarter than average users also are always concerned with recovering/restoring what they backed up.  (Note: Every time I say something like this, it feels like a blinding flash of the obvious – but it really is overlooked quite often.)  Restoration/recovery backup techniques, from differential backups to synthetic master/full backups, that focus on faster and better restore/recovery behavior, quite often re-introduce at the target additional redundancy – which in turn may be optimized through target-based deduplication techniques.

A Few Links to Discussions of Source and Target Deduplication

A lot has been written on this subject; a few links to some of the better articles are given below.

Network Computing: SNW and Thoughts on Deduplication

Network Computing: Source-Side Deduplication

Wikibon: Source Versus Target Based Deduplication

The Many Flavors of Deduplication


Comments are closed.