Overview of Inline Versus Post-Processing Deduplication for Backup

Inline deduplication means that the deduplication occurs while the backup data is being ingested by the backup device. Post-processing deduplication means that the deduplication occurs after the data has been ingested. In short, inline deduplication means that the deduplication occurs before the data is written to disk while post-processing deduplication means that deduplication occurs afterward.

Advocates of both forms claim that their techniques are superior. However, inline deduplication tends to decrease ingest rates against real-world data (be very careful about quoted ingest rates – they are often done with tuned data that allows faster deduplication ingest rates) while post-processing deduplication only increases ingest rates if the “landing site” for the data is large enough to accommodate all backups and deduplicate them before the next backup arrives.

The Impact of Inline Versus Post-Processing Deduplication on Recovering/Restoring Data

Inline deduplication will typically cause hydration/rehydration overhead when you go to recover/restore your data since the only form of storage that occurs is deduplicated storage.  The impact of post-processing deduplication on hydration/rehydration depends on the strategy taken with respect to post-processing deduplication.

As an example of how post-processing deduplication strategy can change recovery/restoration efficiency (the time it takes to recover/restore your data), I’ll take one example from the way Adaptive Deduplication was implemented.  In Adaptive Deduplication, the data is compressed on ingestion and stored in that form.  Then the post-processing use that latest backup as the deduplication cache (also called a deduplication index.)  This means that recovery of the latest backup will perform with no hydration/rehydration overhead.

Of course, any acceleration of recovery of backup data comes at a price – and in this case the price is that a post-processing deduplication scheme takes more disk space than an ingest deduplication scheme since the “landing zone” for the backup must be taken into account when using post-processing deduplication.  When inline deduplication is used, while there is going to be an ingest-related penalty, you don’t have the extra storage – but then again, just as you pay a price on ingest performance, you’re always going to pay the price on hydration/re-hydration performance as well.