My mom used to tell me that when life hands you lemons, make lemonade. This simple idiom of course just means that you should take any bad situation and attempt to make it better.
The announcement by Data Domain of their “boost” software was a great example of EMC and Data Domain finally admitting to some of the “lemons” associated with dedicated data deduplication devices. What I find so surprising is that there are so many business executives and IT leaders who were themselves surprised by this announcement. Since I have been getting more questions about this each month, I thought it was time to write an article about it.
In order to introduce the heart of what Data Domain’s “Boost” software does, let’s lift the description directly from the press release:
The DD Boost software library is distributed to the backup server and identifies data segments inline as they arrive. After asking the Data Domain storage system which segments are new, it compresses and forwards only the unique segments. In addition to increasing the aggregate backup throughput of the storage system, this reduces local network traffic from 80 to 99 percent, since redundant data segments do not traverse the wire only to be discarded as duplicates later. It also reduces the overall resource use on existing backup servers by 20 to 40 percent because it minimizes data copy overhead.
I’ve had calls from customers of Data Domain asking me what this means, so let me be as clear as possible. Writing backup data to a separate storage device takes time and cost processor cycles. This is the fundamental weakness of target-based data deduplication storage vendors. Yes – there are some folks surprised by this because it NEVER gets talked about by the data deduplication vendors themselves – and yet it has been and always will be the Achilles heel (the fundamental weakness) of dedicated data deduplication devices.
To understand the spin that EMC and Data Domain puts on this, let’s again take a verbatim quote from their press release announcing the “boost” software:
“There has been no significant change in roles between backup clients, backup servers and target storage in the last 20 years of traditional backup software deployment architectures, until right now,” said Brian Biles, Vice President of Product Management, EMC Backup Recovery Systems Division. “By distributing parts of the deduplication process to the backup server, everything from the backup server to the Data Domain systems gets faster and more efficient. It represents a pivotal architectural change that will underpin the next generation of disk-based data protection products. DD Boost literally thinks outside the box.”
This is a pretty strange assertion. After all, EMC acquired Avamar for its source-level deduplication – which pushes processing back onto the client being protected. But from Data Domain’s point of view, this is revolutionary stuff because they are pushing source-level deduplication to the backup server – which is the “source” from the perspective of dedicated deduplication storage device.
[Warning…a little technical…] My first job in the mid-1980s was helping create what was then called a “functionally partitioned” computer in which there was dedicated power for different types of I/O and different types of processors. It didn’t take much analysis to figure out that there was a fundamental flaw in the architecture – that most loads didn’t partition very well in that scheme. This is the reason that “symmetric multiprocessor” systems ended up dominating the landscape – because there wasn’t the overhead imposed on any single task that you always saw in a functionally partitioned system. [Okay…end of slightly technical stuff… :)]
This is the fundamental problem with dedicated data deduplication devices – there is always overhead between the backup processing communicating with the storage processing. The best way to overcome this is to do what Intel and the rest of the industry figured out decades ago with respect to general computing – create a pool of dedicated processing elements that may be scalably increased by adding additional processing elements. In the case of backup, you do this such that each backup device doesn’t have to go out over a relatively slow LAN (when compared to processor, memory, and disk speeds) in order to perform the storage of each and every bit of the backup.
That’s a “scalable backup” rather than “federated storage” approach. It’s simply superior.
One other note. While this article points out the negatives to what Data Domain is doing, they should at least be given credit for trying to lessen the pain around a fundamental architectural problem with data deduplication devices – they are ahead of the curve when compared to Exagrid and other data deduplication devices that have not begun to attempt to put a band-aid on this.