[In part one of this blog post I talked about backup and global deduplication for backup appliances. In this section I’m going to offer an alternative way to think about the problem.]
Most folks who are interested in backup have heard about the various forms of deduplication – but it’s less frequent that they’ve heard of NoSQL or sharding. In this post, I’m going to tie these concepts together.
Back when the Internet was getting going, companies like Google and YouTube and Facebook and others faced a pretty significant issue. They wanted to be able to scale to millions and billions of users. But the old way of handling data – in a large SQL database – had run out of steam. Thus these companies implemented what is called generically NoSQL and sharding architectures. NoSQL simply means that a database doesn’t use the SQL query language, which frees the database implementation from having to support table schemas, joins, ACID (atomicity, consistency, isolation, and durability) – all the things that are the reason people pay big bucks for Oracle and other databases. Instead, NoSQL databases are designed to be much more inherently scalable than big SQL-based databases.
Sharding is a technique where the row of a database are separated across multiple computers instead of being split by column (i.e., all that stuff associated with SQL database normalization and vertical partitioning.) A shard is a partition of data.
So what in the heck does this have to do with backup and global deduplication?
Real backup appliances are architecturally similar to NoSQL and sharding. Rather than have a single backup server and backup software, or even a collection of backup servers/software – which talk to a single monolithic pool of backup storage – backup appliances that may be monitored and more importantly managed as one allow all aspects of the backup to be parallelized. In essence, where sharding disassociates data across multiple computers for tremendously higher performance, backup appliances allow the dissassociation of backup storage so that the backup software, backup server, and backup storage are all part of a single entity. By coupling the backup software, backup server, and backup storage together you tend to get much higher performance at an insanely lower cost – because it’s a simpler architecture that just inherently scales better across data.
If you were to logically conflate the pools of data across real backup appliances (note backup storage devices, but appliances that do all phases of backup), then you’d lose that advantage – and move to a more complex and more costly architecture.
This is a LOT more complicated than my normal modern backup post – but I thought folks might be interested in understanding some of the deeper technology underlying what we do. Please let me know your thoughts on this as well!