Backup, Compression, Encryption, Deduplication, and Replication: Solution
In the previous post on backup, compression, encryption, and deduplication I called out a great post by Stephen Foskett on this subject and promised I’d talk a little bit about what our company does with respect to the issues raised.
First, let’s talk about what Stephen did. As he notes in the referenced post, he found the following was effective for him
rsync is a solid protocol, reducing network demands by only sending the changed blocks of a file. But, as noted, compression and encryption tools change the whole file even if only a tiny bit has been altered. A few years back, the folks behind rsync (who also happen to be the minds behind the Samba CIFS server) developed a patch for gzip which causes it to compress files in chunks rather than in their entirety. This patch, called gzip-rsyncable, hasn’t been added to the main source even after a dozen years, but yields amazing results in accelerating rsync performance.
The same technique was then applied to RSA and AES cryptography to create rsyncrypto. This open source encryption tool makes a simple tweak to the standard CBC encryption schema (reusing the initialization vector) to allow encrypted files to be sent more efficiently over rsync. In fact, it relies on gzip-rsyncable to work its magic. Of course, the resulting file is somewhat less secure, but it is probably more than enough to keep a casual snooper at bay.
This works as long as the key never changes and all parties deduplicating using “global federated data” use the same key. Also, if the changed data crosses an encryption block boundary, more data appears changed than really is – which degrades deduplication savings.
What Unitrends does for its encryption, deduplication, compression, and both private and public cloud computing (using replication as the underlying technology) is to encrypt first and then compress on the on-premise appliance for the precise reasons Stephen outlines. The added benefit to this is that we don’t have to decrypt on our off-premise appliance.
Our replication technology takes advantage of encryption blocks when transferring data from the on-premise appliance to the off-premise applinace. The on-premise deduplication technology has to uncompress but can keep the data encrypted and still find blocks that can be deduplicated.
Our on-premise deduplication decompresses and decrypts the data in primary storage to calculate hashes on the raw data (incoming data is typically compressed inline on ingestion.) To actually deduplicate the data, the data is again decompressed and decrypted in primary storage and encrypted before stored on disk in deduplicated format. This allows the deduplicated data to live across key changes or differing keys. On disk, the data is always encrypted when requested.