You are currently browsing the tag archive for the ‘recovery’ tag.
Do you have a clearly defined Recovery Point Objective (RPO) for your data? What about a clearly defined Recovery Time Objective (RTO)?
One challenge I run in to quite often is that, while most customers assume they need to protect their data in some way, they don’t have clear cut RPO and RTO requirements, nor do they have a realistic budget for deploying backup and/or other data protection solutions. This makes it difficult to choose the appropriate solution for their specific environment. Answering the above questions will help you choose a solution that is the most cost effective and technically appropriate for your business.
But how do you answer these questions?
First, let’s discuss WHY you back up… The purpose of a backup is to guarantee your ability to restore data at some point in the future, in response to some event. The event could be inadvertent deletion, virus infection, corruption, physical device failure, fire, or natural disaster. So the key to any data protection solution is the ability to restore data if/when you decide it is necessary. This ability to restore is dependent on a variety of factors, ranging from the reliability of the backup process, to the method used to store the backups, to the media and location of the backup data itself. What I find interesting is that many customers do not focus on the ability to restore data; they merely focus on the daily pains of just getting it backed up. Restore is key! If you never intend to restore data, why would you back it up in the first place?
What is the Risk?
USA Today published an article in 2006 titled “Lost Digital Data Cost Businesses Billions“ referencing a whole host of surveys and reports showing the frequency and cost to businesses who experience data loss.
Two key statistics in the article stand out.
- 69% of business people lost data due to accidental deletion, disk or system failure, viruses, fire or another disaster
- 40% Lost data two or more times in the last year
Flipped around, you have at least a 40% chance of having to restore some or all of your data each year. Unfortunately, you won’t know ahead of time what portion of data will be lost. What if you can’t successfully restore that data?
This is why one of my coworkers refuses to talk to customers about “Backup Solutions”, instead calling them “Restore Solutions”, a term I have adopted as well. The key to evaluating Restore Solutions is to match your RPO and RTO requirements against the solution’s backup speed/frequency and restore speed respectively.
Recovery Point Objective (RPO)
Since RPO represents the amount of data that will be lost in the event a restore is required, the RPO can be improved by running a backup job more often. The primary limiting factor is the amount of time a backup job takes to complete. If the job takes 4 hours then you could, at best, achieve a 4-hour RPO if you ran backup jobs all day. If you can double the throughput of a backup, then you could get the RPO down to 2 hours. In reality, CPU, Network, and Disk performance of the production system can (and usually is) affected by backup jobs so it may not be desirable to run backups 24 hours a day. Some solutions can protect data continuously without running a scheduled job at all.
Recovery Time Objective (RTO)
Since RTO represents the amount of time it takes to restore the application once a recovery operation begins, reducing the RTO can be achieved by shortening the time to begin the restore process, and speeding up the restore process itself. Starting the restore process earlier requires the backup data to be located closer to the production location. A tape located in the tape library, versus in a vault, versus at a remote location, for example affects this time. Disk is technically closer than tape since there is no requirement to mount the tape and fast forward it to find the data. The speed of the process itself is dependent on the backup/restore technology, network bandwidth, type of media the backup was stored on, and other factors. Improving the performance of a restore job can be done one of two ways – increase network bandwidth or decrease the amount of data that must be moved across the network for the restore.
This simple graph shows the relationship of RTO and RPO to the cost of the solution as well as the potential loss.The values here are all relative since every environment has a unique profit situation and the myriad backup/restore options on the market cover every possible budget.
Improving RTO and/or RPO generally increases the cost of a solution. This is why you need to define the minimum RPO and RTO requirements for your data up front, and why you need to know the value of your data before you can do that. So how do you determine the value?
Start by answering two questions…
How much is the data itself worth?
If your business buys or creates copyrighted content and sells that content, then the content itself has value. Understanding the value of that data to your business will help you define how much you are willing to spent to ensure that data is protected in the event of corruption, deletion, fire, etc. This can also help determine what Recovery Point Objective you need for this data, ie: how much of the data can you lose in the event of a failure.
If the total value of your content is $1000 and you generate $1 of new content per day, it might be worth spending 10% of the total value ($100) to protect the data and achieve an RPO of 24 hours. Remember, this 10% investment is essentially an insurance policy against the 40% chance of data loss mentioned above which could involve some or all of your $1000 worth of content. Also keep in mind that you will lose up to 24 hours of the most recent data ($1 value) since your RPO is 24 hours. You could implement a more advanced solution that shortens the RPO to 1 hour or even zero, but if the additional cost of that solution is more than the value of the data it protects, it might not be worth doing. Legal, Financial, and/or Government regulations can add a cost to data loss through fines which should also be considered. If the loss of 24 hours of data opens you up to $100 in fines, then it makes sense to spend money to prevent that situation.
How much value does the data create per minute/hour/day?
Whether or not your data itself has value on it’s own, the ability to access it may have value. For example, If your business sells products or services through a website and a database must be online for sales transactions to occur, then an outage of that database causes loss of revenue. Understanding this will help you define a Recovery Time Objective, ie: for how long is it acceptable for this database to be down in the event of a failure, and how much should you spend trying to shorten the RTO before you get diminishing returns.
If you have a website that supports company net profits of $1000 a day, it’s pretty easy to put together an ROI for a backup solution that can restore the website back into operation quickly. In this example, every hour you save in the restore process prevents $42 of net loss. Compare the cost of improving restore times against the net loss per hour of outage. There is a crossover point which will provide a good return on your investment.
Your vendor will be happy when you give them specific RPO and RTO requirements.
Nothing derails a backup/recovery solution discussion quicker than a lack of requirements. Your vendor of choice will most likely be happy to help you define them but it will help immensely if you have some idea of your own before discussions start. There are many different data protection solutions on the market and each has it’s own unique characteristics that can provide a range of RPO and RTO’s as well as fit different budgets. Several vendors, including EMC, have multiple solutions of their own — one size definitely does not fit all. Once you understand the value of your data, you can work with your vendor(s) to come up with a solution that meets your desired RPO and RTO while also keeping a close eye on the financial value of the solution.
Well, not exactly. What you really need is a restore solution!
I was discussing this with a colleague recently as we compared difficulties multiple customers are having with backups in general. My colleague was relating a discussion he had with his customer where he told them, “stop thinking about how to design a backup solution, and start thinking about how to design a restore solution!”
Most of our customers are in the same boat, they work really hard to make sure that their data is backed up within some window of time, and offsite as soon as possible in order to ensure protection in the event of a catastrophic failure. What I’ve noticed in my previous positions in IT and more so now as a technical consultant with EMC is that (in my experience) most people don’t really think about how that data is going to get restored when it is needed. There are a few reasons for this:
- Backing up data is the prerequisite for a restore; IT professionals need to get backups done, regardless of whether they need to restore the data. It’s difficult to plan for theoretical needs and restore is still viewed, incorrectly, as theoretical.
- Backup throughput and duration is easily measured on a daily basis, restores occur much more rarely and are not normally reported on.
- Traditional backup has been done largely the same way for a long time and most customers follow the same model of nightly backups (weekly full, daily incremental) to disk and/or tape, shipping tape offsite to Iron Mountain or similar.
I think storage vendors, EMC and NetApp particularly, are very good at pointing out the distinction between a backup solution and a restore solution, where backup vendors are not quite as good at this. So what is the difference?
When designing a backup solution the following factors are commonly considered:
- Size of Protected Data – How much data do I have to protect with backup (usually GB or TB)
- Backup Window – How much time do I have each night to complete the backups (in hours)
- Backup Throughput – How fast can I move the data from it’s normally location to the backup target
- Applications – What special applications do I have to integrate with (Exchange, Oracle, VMWare)
- Retention Policy – How long do I have to hang on to the backups for policy or legal purposes
- Offsite storage – How do I get the data stored at some other location in case of fire or other disaster
If you look at it from a restore prospective, you might think about the following:
- How long can I afford to be down after a failure? Recovery Time Objective (RTO): This will determine the required restore speed. If all backups are stored offsite, the time to recall a tape or copy data across the WAN affects this as well.
- How much data can I afford to lose if I have to restore? Recovery Point Objective (RPO): This will determine how often the backup must occur, and in many cases this is less than 24 hours.
- Where do I need to restore the application? This will help in determining where to send the data offsite.
Answer these questions first and you may find that a traditional backup solution is not going to fulfill your requirements. You may need to look at other technologies, like Snapshots, Clones, replication, CDP, etc. If a backup takes 8 hours, the restore of that data will most likely take at least 8 hours, if not closer to 16 hours. If you are talking about a highly transactional database, hosting customer facing web sites, and processing millions of dollars per hour, 8 hours of downtime for a restore is going to cost you tens or hundreds of millions of dollars in lost revenue.
Two of my customers have database instances hosted on EMC storage, for example, which are in the 20TB size range. They’ve each architected a backup solution that can get that 20TB database backed up within their backup window. The problem is, once that backup completes, they still have to offsite the backup, and replicate it to their DR site across a relatively small WAN link. They both use compressed database dumps for backup because, from the DBA’s perspective, dumps are the easiest type of backup to restore from, and the compression helps get 20TB of data pushed across 1gbe Ethernet connections to the backup server. One of the customers is actually backing up all of their data to DataDomain deduplication appliances already; the other is planning to deploy DataDomain. The problem in both cases is that, if you pre-compress the backup data, you break deduplication, and you get no benefit from the DataDomain appliance vs. traditional disk. Turning off compression in the dump can’t be done because the backup would take longer than the backup window allows. The answer here is to step back, think about the problem you are trying to solve–restoring data as quickly as possible in the event of failure–and design for that problem.
How might these customers leverage what they already have, while designing a restore solution to meet their needs?
Since they are already using EMC storage, the first step would be to start taking snapshots and/or clones of the database. These snapshots can be used for multiple purposes…
- In the event of database corruption, or other host/filesystem/application level problem, the production volume can be reverted to a snapshot in a matter of minutes regardless of the size of the database (better RTO). Snapshots can be taken many times a day to reduce the amount of data loss incurred in the event of a restore (better RPO).
- A snapshot copy of the database can be mounted to a backup server directly and backed up directly to tape or backup disk. This eliminates the requirement to perform database dumps at all as well as any network bottleneck between the database server and backup server. Since there is no dump process, and no requirement to pre-compress the data, de-duplication (via DataDomain) can be employed most efficiently. Using a small 10gbps private network between the backup media servers and DataDomain appliances, in conjunction with DD-BOOST, throughput can be 2.5X faster than with CIFS, NFS, or VTL to the same DataDomain appliance. And with de-duplication being leveraged, retention can be very long since each day’s backup only adds a small amount of new data to the DataDomain.
- Now that we’ve improved local restore RTO/RPO, eliminated the backup window entirely for the database server, and decreased the amount of disk required for backup retention, we can replicate the backup to another DataDomain appliance at the DR site. Since we are taking full advantage of de-duplication now, the replication bandwidth required is greatly reduced and we can offsite the backup data in a much shorter period of time.
- Next, we give the DBAs back the ability to restore databases easily, and at will, by leveraging EMC Replication Manager. RM manages the snapshot schedules, mounting of snaps to the backup server, and initiation of backup jobs from the snapshot, all in a single GUI that storage admins and DBAs can access simultaneously.
So we leveraged the backup application they already own, the DataDomain appliances they already own, storage arrays they already own, built a small high-bandwidth backup network, and layered some additional functionality, to drastically improve their ability to restore critical data. The very next time they have a data integrity problem that requires a restore, these customer’s will save literally millions of dollars due to their ability to restore in minutes vs. hours.
If RPO’s of a few hours are not acceptable, then a Continuous Data Protection (CDP) solution could be added to this environment. EMC RecoverPoint CDP can journal all database activity to be used to restore to any point in time, bringing data loss (RPO) to zero or near-zero, something no amount of snapshots can provide, and keeping restore time (RTO) within minutes (like snapshots). Further, the journaled copy of the database can be stored on a different storage array providing complete protection for the entire hardware/software stack. RecoverPoint CDP can be combined with Continuous Remote Replication (CRR) to replicate the journaled data to the DR site and provide near-zero RPO and extremely low RTO in a DR/BC scenario. Backups could be transitioned to the DR site leveraging the RecoverPoint CRR copies to reduce or eliminate the need to replicate backup data. EMC Replication Manager manages RecoverPoint jobs in the same easy to use GUI as snapshot and clone jobs.
There are a whole host of options available from EMC (and other storage vendors) to protect AND restore data in ways that traditional backup applications cannot match. This does not mean that backup software is not also needed, as it usually ends up being a combined solution.
The key to architecting a restore solution is to start thinking about what would happen if you had to restore data, how that impacts the business and the bottom line, and then architect a solution that addresses the business’ need to run uninterrupted, rather than a solution that is focused on getting backups done in some arbitrary daily/nightly window.
In my new role at EMC, I am one of the first people to learn of major problems that my customers experience. In general, customers seem to call their sales team before technical support when a big problem happens. In the past week, I’ve been involved in recovery efforts with two different customers, both resulting from complete power outages in their production datacenters.
Both of these customers process millions of dollars through their global customer facing websites. The smaller customer of the two does not have a disaster recovery site of any kind, while the other (larger) customer does have a recovery site, but it is not designed for 100% operation and is hundreds of miles away.
What became clear through both of these incidents is that having a very clear, very well known recovery plan is critical to the business. Interestingly, these experiences drove home the point that even if you don’t have a recovery site, aren’t using replication, and otherwise don’t have any way to recover the data offsite, you still need a plan that encompasses what you CAN do. More often than not, major outages are short lived and you will be recovering in your primary datacenter anyway, so you need to have a pre-determined plan to prevent major issues and shorten the time to recover.
Here are some things to think about when creating a recovery plan:
- Get the application owners together and build a list of all the applications running in your environment. Document the purpose of each application and map dependencies that each application has on other applications.
- Next, involve the server/systems admins and document the server names, database names, IP addresses, and DNS names for each application on the list.
- Finally, involve the infrastructure teams (storage, network, datacenter) and document the network dependencies (subnets, routers, VPN connections, load balancers, etc). Document any SAN storage used by the servers/applications. Also document how each infrastructure component affects others (ie: the SAN switches are required to be operational before servers can connect to storage arrays.)
- Work with business leaders to prioritize the applications. The idea is to understand how much impact each application has to the business both from a productivity perspective as well as direct financial impact. There may be legal requirements or service level agreements with customers to consider as well.
- If possible, identify the maximum amount of time each application can be down in the event of a catastrophic event (RTO – Recovery Time Objective) and how much data can be lost without significant impact to the business (RPO – Recovery Point Objective). These metrics are usually measured in minutes, hours, and days.
- Document the backup method for each server and application. How often are backups run? What is the retention period? How long does it take to complete backups? What is the expected time to restore the data? How long does it take to recall tapes from offsite storage?
- At this point you have a prioritized list of applications, now build a step by step recovery plan that lists the exact order in which you must recover systems. The list should include server names as well as validation points to ensure certain systems are working before moving to the next step. For example:
- Step 1: bring up the network switches and routers
- Step 2: bring up the DNS/DHCP servers
- Step 3: bring up Active Directory servers
- Step 4: bring up SAN fabric switches
- Step 5: bring up SAN storage arrays, verify health of arrays with help from vendor
- Step 6: …
I recommend that one of the first steps before starting recovery is to contact your key vendors (storage array vendors at least) to notify them of your outage so they can get support resources ready to troubleshoot any hardware issues you may run into during the recovery.
- Identify key players needed in a recovery, at least primary and secondary contacts for every application and vendor contacts for hardware/software, facilities, UPS/Generator support teams, etc.
- Establish a standard communication plan to include at least the following…
- A method to notify employees of an outage and give instructions
- A method to notify key players for recovery
- A mechanism for key players to communicate with each other during the recovery
- Personal (not corporate/business) contact information for all of the key players
The key thing to remember here is that you cannot rely on any communication tools that are part of your infrastructure. You must assume your PBX/VOIP system will be down, Email will be down, corporate instant messenger will be down, Sharepoint will be unavailable, etc.
- If you have a remote recovery site, with or without replication technology, and intend to use the remote site to recover production applications in the event of a large failure, be sure to document the triggers for moving to the recovery site. As an example, you may want to attempt recovery in the primary site, and then move to the recovery site if recovery at the primary site will take too long — be sure to document that time and get executive buyoff. You should not hear “how long do we wait until we move to the DR site?” during an active recovery operation. That decision needs to be made during the planning exercise.
- Document the entire plan and store the digital copies in a readily accessible place (file shares, Sharepoint site, etc). Keep additional copies on USB sticks or CDs stored in a safe place. Keep even MORE copies in another location outside the primary datacenter facility (ie: safe deposit box, remote office safe, etc). Print copies as well and store the printed copy in similar safe places. Assume that a building may not be accessible due to fire or flood. I know one customer who issues fingerprint secured USB sticks to every manager. Each manager must sync their USB stick to a server at least monthly or upper management is notified.
- Make sure that everyone is aware of the recovery plan, who has access to the plan, where the copies are stored, and what role each of the key players is expected to play during a recovery.
There is far more to think about but hopefully you can get a good start with what I’ve listed above. If you have a recovery plan already, you should review it regularly and think about anything that needs to be added or modified in the plan.
If you are trying to get approval for a remote recovery site and replication technology and having trouble getting executive approval, going through this exercise and defining application priority with RPO/RTO for each could give you the ammo you need. Traditional backup architectures aren’t designed for RPO’s under 24 hours while storage array based replication can get RPOs down into the minutes and restoring from tape takes way longer than restoring from replicated data.
Last but not least, keep the plan updated as your environment changes, add new application and server details to the plan as part of the implementation process for new applications, or as part of change control procedures for significant changes to the infrastructure.