Deduplication is the backbone of modern continuity and data protection systems.  It’s also one of the hardest to evaluate in the typical 30-day evaluation – because there are so many techniques that vendors can use to mask poor deduplication approaches.  In this post, we’re going to discuss the topics and questions you want to make sure ask and get answered when considering Veeam:

Ask about global deduplication versus per-job deduplication.  Deduplication removes duplicate data.  The probability of duplicate data typically increases as the data and systems being protected increases.  Thus you want to make sure that your deduplication method effortless and seamlessly accommodates the increase in data and systems that you’re protecting.  Global deduplication means that you’re doing deduplication based on all of the data that you’re protecting; per-job deduplication means that you’re doing deduplication only within a single backup job at a time.   The figure above depicts the difference between the two deduplication technology approaches.  If you’re doing 10 1TB backups with global deduplication versus per-job deduplication, and your deduplication ratio is 10:1 (90%), then you’d expect on just your first full to see a huge difference between the two approaches.  The per-job deduplication yields 95% with each backup, so your overall footprint would be 100GB per backup or 1TB total.  With global deduplication, your footprint is 100GB – an order of magnitude better.  And note that this example is actually biased toward per job deduplication – global deduplication will not typically yield the same deduplication ratio.  Note that this effect compounds over time.

Ask about deduplication for incremental backups.  This falls under reading the fine print.  If a vendor states that they offer deduplication; a common assumption would be for it to be across all backup types.  However, there are vendors that do not deduplicate incremental backups.  With more common backup strategies such as incremental forever this can mean that no data beyond the initial backup is deduplicated.

Ask about deduplication across fulls and synthetics.  To add insult to injury regarding global versus per-job deduplication, some products that do per-job deduplication also don’t deduplicate between any of their fulls (or if using a “forever” policy that only does one full and then synthetics afterwards, there is no deduplication between the first full and all synthetics afterwards.)  Thus if your single full is 10TB and your second full (or synthetic) is 10TB with 10% change, you will see over 2TB of storage used.  On a system in which deduplication occurs across fulls and synthetics, you will see 1.1TB of storage used.  And of course this compounds over time as well.

Ask about deduplication and replication.  Vendors with per-job deduplication often need to re-deduplicate prior to replication to the cloud or another system.  The reason is that poor local deduplication practices locally lead to missing RPOs (Recovery Point Objectives) as the amount of duplicate data sent overwhelms WAN bandwidth and as local re-deduplication takes time, processor, memory, and I/O resources.  Ask hard questions about precisely how deduplication and replication interact and ask specifically if data is being “more deduplicated” prior to replication; if so it will tend to use much more backup system resources and require an extra complex step in the backup software to replicate.

Ask about the impact of third-party deduplication on not only price but TCO and continuity.  When a vendor has poor deduplication, then often they will reference using third-party deduplication as a method to solve their poor deduplication.  You want to ask about not only the increased price of purchase but also the performance and deduplication drawbacks of these types of solutions.  These drawbacks typically appear due to the segregated policies of backup and deduplication – basically you can deduplicate data better if you understand the data.  The increased TCO in managing multiple vendors is an issue of course that needs to be understood.  And finally, understand that modern backup is about integrating your continuity (instant recovery, archiving, replication) with your deduplication architecture so you’re not seeing lower than advertised RPOs (Recovery Point Objectives) and RTOs (Recovery Time Objectives) from segregated backup and deduplication.

For a more comprehensive view of deduplication and deduplication vendors in general, please see

  • Deduplication and Continuity.  This paper focuses on Unitrends and its adaptive inline deduplication feature first available in release 8.2.  Deduplication is also discussed within the context of continuity – since advanced deduplication can adversely affect how quickly recovery and disaster recovery – and we discuss the techniques Unitrends developed to overcome that issue.

What are you seeing regarding deduplication – past, present, and future?  As always, we’d love to hear from you.

 

Guide

Comments

  1. What I don’t see mentioned here are the effects of deduplication on restore times. Restoring a fully deplicated backup, especially one that uses forward referencing, will thrash the backend storage as it tries to piece together all of the bits from all of the backups. The last thing you want when trying to restore a large system is to have a bunch of random read IO from your backup store.

    1. By the way – comments do not format well at all! So if you have trouble reading the comment, recommend going to that link where we have a white paper on this.

  2. Kevin – seriously great point. If you take a look at this document that was linked to in the text – named “Deduplication and Continuity” – link is at https://www.unitrends.com/docs/papers/deduplication-and-continuity – I think we hit on exactly what you’re calling out. It took us two years and a bunch of really good software developers to get inline deduplication rehydrating fast. For Unitrends, because we don’t only support replication but also local archive, this was particularly important. Below is the text from that paper that talks about recovery speeds.

    Prior to Unitrends Release 8.2, we used a landing zone architecture for just this reason. After two years and millions
    of dollars of work, we were able to eliminate the landing zone and still have excellent backup and recovery speeds.
    Our backup ingest (how fast we can read the backup) is equal to competitors who do “fake backup” and is 200% or
    more faster than many of our competitors. Our restore speeds are up to an order of magnitude faster than some of
    our competitors who show better than average deduplication efficacy (note: but still lower deduplication efficacy than
    Unitrends.) How did we do it?

    • Deduplication-optimized backups. By ensuring backups ingest data on known deduplication-friendly boundaries, our
    backups contain far less fragmentation than many backup systems. This is more difficult than it sounds because the
    applications, operating systems, hypervisors, and other assets that we protect have natural boundaries and we must
    ensure that our ingest algorithms and deduplication algorithms work together on a protected data basis—in other
    words, we must be content aware.

    • Optimized buffer management. General performance optimization of restore code paths and the optimization of ap‑
    plication and filesystem buffers, including an intelligent algorithm to know when performance can be improved by not
    caching data in buffers. This results in faster recoveries due to faster rehydration.

    • Anticipatory pre-fetch. We use anticipatory pre-fetch with respect to read-ahead buffering of the single instance dedupli‑
    cation store to perform recoveries faster due to faster rehydration.

    • Parallelization. We use a fine-grained multiprocessing architecture to optimize both deduplication and rehydration to
    increase the performance of backup ingest and recovery rehydration.

    • IOPS optimization. We reduce the number of IOPS (I/O Operations Per Second) required of our backup storage through
    a number of IOPS optimization techniques; this in turn means that our backup ingest and recovery rehydration is much
    faster.

    Thanks tremendously for taking the time to comment on the post – please let me know if there’s any other questions, issues, or comments you have. Thanks!

  3. You mentioned content awareness. If you have the time, and without giving away the keys to the kingdom, can you explain a bit more. In my experience with some products, Veamm in particular, there is little difference in the treatment of the data. If there is a difference in the way that software treats my data which introduces errors or the possibility of a restore failure, I’d like to know about it. Disclosure – I don’t use Veamm or AWS right now. Just trying to stay current.

Comments are closed.