Overview of File- and Block-Level Deduplication for Backup
File-level deduplication works at the file level by eliminating duplicate files; block-level deduplication works at a block level (which may be a fixed-size block or a variably-sized block) by eliminating duplicate blocks.
The advantage of file-level deduplication is that it takes less resources and thus may be deployed over larger amounts of physical storage; the disadvantage of file-level deduplication is that it can’t eliminate smaller redundant “chunks” of data than a file. The advantage of block-level deduplication is that it can eliminate chunks of data smaller than a file; the disadvantage is that it can’t be deployed over larger amounts of physical storage.
A Few Caveats Regarding File- Versus Block-Level Deduplication
Now, to be fair, when I say that file-level deduplication takes less resources and thus may be deployed over larger amounts of physical storage, that only applies if you hold time as a constant. What do I mean by that? I mean that if you have infinite amounts of time or resources (CPUs and physical memory), then block level deduplication with sufficiently small and flexible enough blocks will always out-perform file-level deduplication. The trouble, of course, is that there are relatively few environments in which there is infinite time or resources! So it’s always a trade-off.
There are many, many technical and algorithmic approaches to the definition of the “chunks” that are to be deduplicated. File, subfile, chunk, block, byte, and bit are all used to describe various approaches. On an algorithm basis, you want to determine what HAS NOT changed as quickly as possible so you can move on to working on determining how to deduplicate what has changed in the most optimal manner.