This is the fourth in the series “everybody lies: backup”. You can find the first three installments here, here, & here.

The fundamental problem with computer-based systems is that they lose data. The fundamental problem in backup is the linear relationship between the amount of primary storage and the amount of secondary storage needed to back it up. The holy grail of backup is to intrinsically change that relationship between primary and secondary data.

That sounded pretty fancy, right? It’s relatively simple. If I have 100TB of data, then to back that up I need some amount of backup storage. For backup systems, it’s typically best-case around 100TB. Why? Because typical compression will yield you 2:1 (note: yes, there are cases where you’ll achieve much bigger numbers – just as there are cases where you’ll achieve much lower numbers) and you’ll need at least two backups. Why do you need two backups? Because if you don’t have the storage space for two full backups, then you have to delete the first backup before the second one has completed. If that second backup fails after you’ve deleted the first backup, then you’re no longer a backup solution – you’re a boat anchor.

Over the years there have been many technologies beyond compression that have been created and marketed as solving this problem. One is deduplication. The problem is that deduplication works best when there’s more existing redundant data over time. How do you get redundant data? Well, either you have a whole lot of redundant data in your environment – or you have a lot of retention. The reason that you see all of these “deduplication calculators” that assume you’re doing a few centuries of daily full backups is that they show deduplication in the best possible light.

Too often, backup strategy isn’t discussed when talking about deduplication. It should be. Backup strategy can dramatically increase the deduplication ratio of the system – while simultaneously having no impact on the amount of secondary backup storage needed.

That doesn’t make sense, right? Or does it?

The reason that this can occur is that if your backup strategy is to do a full backup every day, and you have enough backup storage to get some retention, you’re going to have great deduplication ratios. In essence, you’ve got that deduplication subsystem working really hard. Nothing wrong with that. But you’ve got everything in your network working hard as well – from the systems you’re backing up to your network to your backup server.

The best way to do deduplication is to start by having as little redundant data as possible. Typically that means an incremental forever type of policy where you have one full backup followed by incremental backups, and the backup system is able to create a “synthetic full” backup. You still use deduplication, but now deduplication is going to have a lower deduplication ratio because there’s less redundant data.

The moral of this story? Deduplication is incredibly valuable – but don’t fall for unbelievable deduplication ratios that have as their basis doing a full backup every day.