Stretch the analogy
So mentioning MapReduce in connection with backup will probably get lots of funky agile programmers rolling their eyes at me but hey! I am a simple guy and saw an analogy that might work… let’s see.
So previously we found that in order to backup at scale we need to automate the living daylights out of our backup processes. This can be done by using off the shelf products like EMC Avamar integrated with vCloud Director or by bespoking your backup environment yourself (this is really only for a few huge Google scale environments).
Distribute your load
So onwards with the shoehorned MapReduce analogy; my simple minded view of the MapReduce process is as follows:
In order to backup at scale we really need to do the same type of distribution of the workload and then collection of the results. So a backup system built around the MapReduce architecture would exhibit this type of workflow:
In traditional backup architectures you would have to roll out backup clients to all these application or file servers in order to get them to do a backup. This locks the backup servers into doing a load of IO and encapsulating all the backups into a proprietary backup format which is not massively scalable (big, but not huge).
However the more modern scalable approach is to integrate with the backup function supplied by the application, get this function to write the data to some protection storage (a deduplication appliance for instance) and then to report back to a central catalog that the backup is done. This way you can more easily scale your backup catalog because that server isn’t bogged down with the workload of actually moving the data around. So schematically the architecture would look like this:
To backup at scale, take the IO workload away from the backup server, distribute it throughout the enterprise, using the resource on the application servers. Send the backups directly to the protection storage in the application native format to make it simple for recoveries. Create a central backup authority for maintaining a backup catalog, enforcing the backup policies, collecting alerts and providing operational and chargeback reports.
Summary of the two articles on how to backup at scale – Automate and Distribute, simples…
And if this looks a bit like the EMC data protection vision it is completely coincidental! … honest