Deduplication is the backbone of modern continuity and data protection systems.  It’s also one of the hardest to evaluate in the typical 30-day evaluation – because there are so many techniques that vendors can use to mask poor deduplication approaches.  In this post, we’re going to discuss the topics and questions you want to make sure ask and get answered when considering Veeam:

Ask about global deduplication versus per-job deduplication.  Deduplication removes duplicate data.  The probability of duplicate data typically increases as the data and systems being protected increases.  Thus you want to make sure that your deduplication method effortless and seamlessly accommodates the increase in data and systems that you’re protecting.  Global deduplication means that you’re doing deduplication based on all of the data that you’re protecting; per-job deduplication means that you’re doing deduplication only within a single backup job at a time.   The figure above depicts the difference between the two deduplication technology approaches.  If you’re doing 10 1TB backups with global deduplication versus per-job deduplication, and your deduplication ratio is 10:1 (90%), then you’d expect on just your first full to see a huge difference between the two approaches.  The per-job deduplication yields 95% with each backup, so your overall footprint would be 100GB per backup or 1TB total.  With global deduplication, your footprint is 100GB – an order of magnitude better.  And note that this example is actually biased toward per job deduplication – global deduplication will not typically yield the same deduplication ratio.  Note that this effect compounds over time.

Ask about deduplication for incremental backups.  This falls under reading the fine print.  If a vendor states that they offer deduplication; a common assumption would be for it to be across all backup types.  However, there are vendors that do not deduplicate incremental backups.  With more common backup strategies such as incremental forever this can mean that no data beyond the initial backup is deduplicated.

Ask about deduplication across fulls and synthetics.  To add insult to injury regarding global versus per-job deduplication, some products that do per-job deduplication also don’t deduplicate between any of their fulls (or if using a “forever” policy that only does one full and then synthetics afterwards, there is no deduplication between the first full and all synthetics afterwards.)  Thus if your single full is 10TB and your second full (or synthetic) is 10TB with 10% change, you will see over 2TB of storage used.  On a system in which deduplication occurs across fulls and synthetics, you will see 1.1TB of storage used.  And of course this compounds over time as well.

Ask about deduplication and replication.  Vendors with per-job deduplication often need to re-deduplicate prior to replication to the cloud or another system.  The reason is that poor local deduplication practices locally lead to missing RPOs (Recovery Point Objectives) as the amount of duplicate data sent overwhelms WAN bandwidth and as local re-deduplication takes time, processor, memory, and I/O resources.  Ask hard questions about precisely how deduplication and replication interact and ask specifically if data is being “more deduplicated” prior to replication; if so it will tend to use much more backup system resources and require an extra complex step in the backup software to replicate.

Ask about the impact of third-party deduplication on not only price but TCO and continuity.  When a vendor has poor deduplication, then often they will reference using third-party deduplication as a method to solve their poor deduplication.  You want to ask about not only the increased price of purchase but also the performance and deduplication drawbacks of these types of solutions.  These drawbacks typically appear due to the segregated policies of backup and deduplication – basically you can deduplicate data better if you understand the data.  The increased TCO in managing multiple vendors is an issue of course that needs to be understood.  And finally, understand that modern backup is about integrating your continuity (instant recovery, archiving, replication) with your deduplication architecture so you’re not seeing lower than advertised RPOs (Recovery Point Objectives) and RTOs (Recovery Time Objectives) from segregated backup and deduplication.

For a more comprehensive view of deduplication and deduplication vendors in general, please see

What are you seeing regarding deduplication – past, present, and future?  As always, we’d love to hear from you.

 

Guide