Backup at Scale – Part 2 – MapReduce for Backup

Stretch the analogy

mapreduce-logo[1] So mentioning MapReduce in connection with backup will probably get lots of funky agile programmers rolling their eyes at me but hey! I am a simple guy and saw an analogy that might work… let’s see.

So previously we found that in order to backup at scale we need to automate the living daylights out of our backup processes.  This can be done by using off the shelf products like EMC Avamar integrated with vCloud Director or by bespoking your backup environment yourself (this is really only for a few huge Google scale environments).

Distribute your load

So onwards with the shoehorned MapReduce analogy;  my simple minded view of the MapReduce process is as follows:

My Simple View

In order to backup at scale we really need to do the same type of distribution of the workload and then collection of the results.  So a backup system built around the MapReduce architecture would exhibit this type of workflow:  

Backup MapReduce

In traditional backup architectures you would have to roll out backup clients to all these application or file servers in order to get them to do a backup.  This locks the backup servers into doing a load of IO and encapsulating all the backups into a proprietary backup format which is not massively scalable (big, but not huge). 

However the more modern scalable approach is to integrate with the backup function supplied by the application, get this function to write the data to some protection storage (a deduplication appliance for instance) and then to report back to a central catalog that the backup is done.  This way you can more easily scale your backup catalog because that server isn’t bogged down with the workload of actually moving the data around.  So schematically the architecture would look like this:

 mapreduce backup architecture Summary

To backup at scale, take the IO workload away from the backup server, distribute it throughout the enterprise, using the resource on the application servers.  Send the backups directly to the protection storage in the application native format to make it simple for recoveries.  Create a central backup authority for maintaining a backup catalog, enforcing the backup policies, collecting alerts and providing operational and chargeback reports.

Summary of the two articles on how to backup at scale – Automate and Distribute, simples…

And if this looks a bit like the EMC data protection vision it is completely coincidental! … honest 😉


Backup at Scale – Part 1 – Linear is badness

linearIn a few technologies recently we see that, by design, performance grows linearly as building blocks are added.  In clustered systems a building block will include CPU, Memory and disk resulting in linear growth of compute performance and capacity.  In the backup world linear just doesn’t cut the mustard.

Who cuts mustard anyway?

Don’t get sidetracked with silly questions like that, use Google!  What I am trying to say is that for backup systems there is a requirement for the “work done to achieve backups” to grow significantly slower than the growth of data to protect.

Imagine a world where 1TB of protected data requires 10% of a building block of “work done”.  Where “work done” is a combination of admin time, compute, backup storage etc.  If our backup processes and technologies required a linear growth of work done then much badness occurs.  Diagrammatically…


No one would ever get to the situation described in the diagram above as they would soon realise that “this just ain’t workin’” and rethink their systems.  However the question is what should the “work done” growth look like?  It needs to be a shallower growth curve than that of the data protected and needs to slow as the capacities increase.  So we can imagine that we would want to achieve something like this:

slow growth

But how… How… HOW!?!

A number of methodologies can be employed to work towards this goal.  The first and most obvious step is to A-U-T-O-M-A-T-E (sounds better if you say it in a robotty way).

Phase 1 -Take the drudge processes (and believe me there are plenty) and automate them:

  1. Checking backup logs for failures
  2. Restarting backups that have failed
  3. Generating reports

Phase 2 – Take some of the more difficult but boring jobs and automate them too!

  1. Restore testing
  2. New backup client requests
  3. Restore requests

If your environment is at Google scale you may want to automate crazy things like purchasing, receipt and labelling of new backup media.  This is an extreme case but you get the principle, break down the tasks done in the backup process and see what you can get machines to do better and more accurately than humans.

There are plenty of people that have already done all this and many products to look at for help. Start Googling…

Is that it? – No, we will return with other methods to help backup at scale


Introducing… The Backup Storage Admin

The good old world of storage…

Typically in organisations there are two distinct roles assigned in the storage department.

1. Storage Admin – the person who provisions and supports primary storage to application or server admins

2. Backup Admin – the person who administers the backup software, tape solution and/or VTL

So where is the role Backup Storage Admin then?
Firstly we need to describe what is changing to open up the field for this new role. Traditionally all backup task and operations have been dragged through some kind of backup application. This consists of a number of components:

1. The Backup server – tracks and catalogs all backups, manages schedules and retention etc
2. Media or Storage node – performs the data movement of the data to be protected
3. Backup client – gathers and informs the server what files need backing up.

All of these components make up the backup product. They need specialist skills and training to understand how to implement them. The application guy doesn’t know the details of how NetBackup, Networker or TSM work, he understands how his application need protecting.

So what needs to change to the way backups are done to allow for this new role?

In our last post we discussed the DD Boost for RMAN feature of the EMC Data Domain. This allows the application guy to use his own backup tools to speak directly to the back end Data Domain storage. This puts the application admin in control of scheduling, retention and management of backups.

Scale this up and think about how this might develop. Imagine you had a VMWare admin who wanted to manage his own backups to a central Backup repository, or a NAS administer that wanted to point his NAS device directly at the backup storage device. Repeat the same question for any number of apps and databases.

No need for a dedicated backup software?
What we have described is that each application uses its own custom data protection method to send backups to a central backup storage device. This means that no longer is the backup admin managing all the database and file modules of a backup product, but is provisioning storage to application teams to be used for backups.  In fact the provisioning of the backup storage could be automated too so that when a service is brought online the backup storage is automatically provisioned, I will avoid using the word cloud at this point 😉

Who will know what is backed up though!?
Good question! There would have to be a way to catalog all these disparate types of backups into a common format that could then be used to report and review backup success. Easier said than done you may think but not an impossibility by any means.

So in summary then, what would the responsibilities of a Backup Storage Admin be:
– Manage pot of storage specifically as a backup target
– Provision backup storage based on requests from application or server teams
– Manage the backup catalog
– Report on capacity trending
– Chargeback storage usage to the application or server teams

So surely this is all a pipe dream, well keep your eye on some of the stuff at EMC World 2012 and what Stephen Manley has to say here