You are currently browsing the tag archive for the ‘deduplication’ tag.
Do you have a clearly defined Recovery Point Objective (RPO) for your data? What about a clearly defined Recovery Time Objective (RTO)?
One challenge I run in to quite often is that, while most customers assume they need to protect their data in some way, they don’t have clear cut RPO and RTO requirements, nor do they have a realistic budget for deploying backup and/or other data protection solutions. This makes it difficult to choose the appropriate solution for their specific environment. Answering the above questions will help you choose a solution that is the most cost effective and technically appropriate for your business.
But how do you answer these questions?
First, let’s discuss WHY you back up… The purpose of a backup is to guarantee your ability to restore data at some point in the future, in response to some event. The event could be inadvertent deletion, virus infection, corruption, physical device failure, fire, or natural disaster. So the key to any data protection solution is the ability to restore data if/when you decide it is necessary. This ability to restore is dependent on a variety of factors, ranging from the reliability of the backup process, to the method used to store the backups, to the media and location of the backup data itself. What I find interesting is that many customers do not focus on the ability to restore data; they merely focus on the daily pains of just getting it backed up. Restore is key! If you never intend to restore data, why would you back it up in the first place?
What is the Risk?
USA Today published an article in 2006 titled “Lost Digital Data Cost Businesses Billions“ referencing a whole host of surveys and reports showing the frequency and cost to businesses who experience data loss.
Two key statistics in the article stand out.
- 69% of business people lost data due to accidental deletion, disk or system failure, viruses, fire or another disaster
- 40% Lost data two or more times in the last year
Flipped around, you have at least a 40% chance of having to restore some or all of your data each year. Unfortunately, you won’t know ahead of time what portion of data will be lost. What if you can’t successfully restore that data?
This is why one of my coworkers refuses to talk to customers about “Backup Solutions”, instead calling them “Restore Solutions”, a term I have adopted as well. The key to evaluating Restore Solutions is to match your RPO and RTO requirements against the solution’s backup speed/frequency and restore speed respectively.
Recovery Point Objective (RPO)
Since RPO represents the amount of data that will be lost in the event a restore is required, the RPO can be improved by running a backup job more often. The primary limiting factor is the amount of time a backup job takes to complete. If the job takes 4 hours then you could, at best, achieve a 4-hour RPO if you ran backup jobs all day. If you can double the throughput of a backup, then you could get the RPO down to 2 hours. In reality, CPU, Network, and Disk performance of the production system can (and usually is) affected by backup jobs so it may not be desirable to run backups 24 hours a day. Some solutions can protect data continuously without running a scheduled job at all.
Recovery Time Objective (RTO)
Since RTO represents the amount of time it takes to restore the application once a recovery operation begins, reducing the RTO can be achieved by shortening the time to begin the restore process, and speeding up the restore process itself. Starting the restore process earlier requires the backup data to be located closer to the production location. A tape located in the tape library, versus in a vault, versus at a remote location, for example affects this time. Disk is technically closer than tape since there is no requirement to mount the tape and fast forward it to find the data. The speed of the process itself is dependent on the backup/restore technology, network bandwidth, type of media the backup was stored on, and other factors. Improving the performance of a restore job can be done one of two ways – increase network bandwidth or decrease the amount of data that must be moved across the network for the restore.
This simple graph shows the relationship of RTO and RPO to the cost of the solution as well as the potential loss.The values here are all relative since every environment has a unique profit situation and the myriad backup/restore options on the market cover every possible budget.
Improving RTO and/or RPO generally increases the cost of a solution. This is why you need to define the minimum RPO and RTO requirements for your data up front, and why you need to know the value of your data before you can do that. So how do you determine the value?
Start by answering two questions…
How much is the data itself worth?
If your business buys or creates copyrighted content and sells that content, then the content itself has value. Understanding the value of that data to your business will help you define how much you are willing to spent to ensure that data is protected in the event of corruption, deletion, fire, etc. This can also help determine what Recovery Point Objective you need for this data, ie: how much of the data can you lose in the event of a failure.
If the total value of your content is $1000 and you generate $1 of new content per day, it might be worth spending 10% of the total value ($100) to protect the data and achieve an RPO of 24 hours. Remember, this 10% investment is essentially an insurance policy against the 40% chance of data loss mentioned above which could involve some or all of your $1000 worth of content. Also keep in mind that you will lose up to 24 hours of the most recent data ($1 value) since your RPO is 24 hours. You could implement a more advanced solution that shortens the RPO to 1 hour or even zero, but if the additional cost of that solution is more than the value of the data it protects, it might not be worth doing. Legal, Financial, and/or Government regulations can add a cost to data loss through fines which should also be considered. If the loss of 24 hours of data opens you up to $100 in fines, then it makes sense to spend money to prevent that situation.
How much value does the data create per minute/hour/day?
Whether or not your data itself has value on it’s own, the ability to access it may have value. For example, If your business sells products or services through a website and a database must be online for sales transactions to occur, then an outage of that database causes loss of revenue. Understanding this will help you define a Recovery Time Objective, ie: for how long is it acceptable for this database to be down in the event of a failure, and how much should you spend trying to shorten the RTO before you get diminishing returns.
If you have a website that supports company net profits of $1000 a day, it’s pretty easy to put together an ROI for a backup solution that can restore the website back into operation quickly. In this example, every hour you save in the restore process prevents $42 of net loss. Compare the cost of improving restore times against the net loss per hour of outage. There is a crossover point which will provide a good return on your investment.
Your vendor will be happy when you give them specific RPO and RTO requirements.
Nothing derails a backup/recovery solution discussion quicker than a lack of requirements. Your vendor of choice will most likely be happy to help you define them but it will help immensely if you have some idea of your own before discussions start. There are many different data protection solutions on the market and each has it’s own unique characteristics that can provide a range of RPO and RTO’s as well as fit different budgets. Several vendors, including EMC, have multiple solutions of their own — one size definitely does not fit all. Once you understand the value of your data, you can work with your vendor(s) to come up with a solution that meets your desired RPO and RTO while also keeping a close eye on the financial value of the solution.
I came across this press release today from a company that I wasn’t familiar with and immediately wanted more information. Cirtas Systems has announced support for Atmos-based clouds, including AT&T Synaptic Storage. Whenever I see these types of announcements, I read on in hopes of seeing real fiber channel block storage leveraging cloud-based architectures in some way. So far I’ve been a bit disappointed since the closest I’ve seen has been NAS based systems, at best including iSCSI.
Cirtas BlueJet Cloud Storage Controller is pretty interesting in its own right though. It’s essentially an iSCSI storage array with a cache and a small amount of SSD and SAS drives for local storage. Any data beyond the internal 5TB of usable capacity is stored in “the cloud” which can be an onsite Private Cloud (Atmos or Atmos/VE) and/or a Public Cloud hosted by Amazon S3, Iron Mountain, AT&T Synaptic, or any Atmos-based cloud service provider.
The neat thing with BlueJet is that it leverages a ton of the functionality that many storage vendors have been developing recently such as data de-duplication, compression, some kind of block level tiering, and space efficient snapshots to improve performance and reduce the costs of cloud storage. It seems that pretty much all of the local storage (SAS, SSD, and RAM) is used as a tiered cache for hot data. This gives users and applications the sense of local SAN performance even while hosting the majority of data offsite.
While I haven’t seen or used a BlueJet device and can’t make any observations about performance or functionality, I believe this sort of block->cloud approach has pretty significant customer value. It reduces physical datacenter costs for power and cooling, and it presents some rather interesting disaster recovery opportunities.
Similar to how Compellent’s signature feature, tiered block storage, has been added to more traditional storage arrays, I think modified implementations of Cirtas’ technology will inevitably come from the larger players, such as EMC, as a feature in standard storage arrays. If you consider that EMC Unified Storage and EMC Symmetrix VMAX both have large caches and block- level tiering today, it’s not too much of a stretch to integrate Atmos directly into those storage systems as another tier. EMC already does this for NAS with the EMC File Management Appliance.
I can imagine leveraging FASTCache and FASTVP to tier locally for the data that must be onsite for performance and/or compliance reasons and pushing cold/stale blocks off to the cloud. Additionally, adding cloud as a tier to traditional storage arrays allows customers to leverage their existing investment in Storage, FC/FCoE networks, reporting and performance trending tools, extensive replication options available, and the existing support for VMWare APIs like SRM and VAAI.
With this model, replication of data for disaster recovery/avoidance only needs to be done for the onsite data since the cloud data could be accessed from anywhere. At a DR site, a second storage system connects to the same cloud and can access the cold/stale data in the event of a disaster.
Another option would be adding this functionality to virtualization platforms like EMC VPLEX for active/active multi-site access to SAN data, while only needing to store the majority of the company’s data once in the cloud for lower cost. Customers would no longer have to buy double the required capacity to implement a disaster recovery strategy.
I’m eagerly awating the implementation of cloud into traditional block storage and I can see how some vendors will be able to do this easily, while others may not have the architecture to integrate as easily. It will be interesting to see how this plays out.
My recent post about Compression vs Dedupe, which was sparked by Vaughn’s blog post about NetApp’s new compression feature, got me thinking more about the use of de-duplication and compression at the same time. Can they work together? What is the resulting effect on storage space savings? What if we throw encryption of data into the mix as well?
What is Data De-Duplication?
De-duplication in the data storage context is a technology that finds duplicate patterns of data in chunks of blocks (sized from 4-128KB or so depending on implementation), stores each unique pattern only once, and uses reference pointers in order to reconstruct the original data when needed. The net effect is a reduction in the amount of physical disk space consumed.
What is Data Compression?
Compression finds very small patterns in data (down to just a couple bytes or even bits at a time in some cases) and replaces those patterns with representative patterns that consume fewer bytes than the original pattern. An extremely simple example would be replacing 1000 x “0”s with “0-1000”, reducing 1000 bytes to only 6.
Compression works on a more micro level, where de-duplication takes a slighty more macro view of the data.
What is Data Encryption?
In a very basic sense, encryption is a more advanced version of compression. Rather than compare the original data to itself, encryption uses an input (a key) to compute new patterns from the original patterns, making the data impossible to understand if it is read without the matching key.
Encryption and Compression break De-Duplication
One of the interesting things about most compression and encryption algorithms is that if you run the same source data through an algorithm multiple times, the resulting encrypted/compressed data will be different each time. This means that even if the source data has repeating patterns, the compressed and/or encrypted version of that data most likely does not. So if you are using a technology that looks for repeating patterns of bytes in fairly large chunks 4-128KB, such as data de-duplication, compression and encryption both reduce the space savings significantly if not completely.
I see this problem a lot in backup environments with DataDomain customers. When a customer encrypts or compresses the backup data before it gets through the backup application and into the DataDomain appliance, the space savings drops and many times the customer becomes frustrated by what they perceive as a failing technology. A really common example is using Oracle RMAN or using SQL LightSpeed to compress database dumps prior to backing up with a traditional backup product (such as NetWorker or NetBackup).
Sure LightSpeed will compress the dump 95%, but every subsequent dump of the same database is unique data to a de-duplication engine and you will get little if any benefit from de-duplication. If you leave the dump uncompressed, the de-duplication engine will find common patterns across multiple dumps and will usually achieve higher overall savings. This gets even more important when you are trying to replicate backups over the WAN, since de-duplication also reduces replication traffic.
It all depends on the order
The truth is you CAN use de-duplication with compression, and even encryption. They key is the order in which the data is processed by each algorithm. Essentially, de-duplication must come first. After data is processed by de-duplication, there is enough data in the resulting 4-128KB blocks to be compressed, and the resulting compressed data can be encrypted. Similar to de-duplication, compression will have lackluster results with encrypted data, so encrypt last.
Original Data -> De-Dupe -> Compress -> Encrypt -> Store
There are good examples of this already;
EMC DataDomain – After incoming data has been de-duplicated, the DataDomain appliance compresses the blocks using a standard algorithm. If you look at statistics on an average DDR appliance you’ll see 1.5-2X compression on top of the de-duplication savings. DataDomain also offers an encryption option that encrypts the filesystem and does not affect the de-duplication or compression ratios achieved.
EMC Celerra NAS – Celerra De-Duplication combines single instance store with file level compression. First, the Celerra hashes the files to find any duplicates, then removes the duplicates, replacing them with a pointer. Then the remaining files are compressed. If Celerra compressed the files first, the hash process would not be able to find duplicate files.
So what’s up with NetApp’s numbers?
Back to my earlier post on Dedupe vs. Compression; what is the deal with NetApp’s dedupe+compression numbers being mostly the same as with compression alone? Well, I don’t know all of the details about the implementation of compression in ONTAP 8.0.1, but based on what I’ve been able to find, compression could be happening before de-duplication. This would easily explain the storage savings graph that Vaughn provided in his blog. Also, NetApp claims that ONTAP compression is inline, and we already know that ONTAP de-duplication is a post-process technology. This suggests that compression is occurring during the initial writes, while de-duplication is coming along after the fact looking for duplicate 4KB blocks. Maybe the de-duplication engine in ONTAP uncompresses the 4KB block before checking for duplicates but that would seem to increase CPU overhead on the filer unnecessarily.
Encryption before or after de-duplication/compression – What about compliance?
I make a recommendation here to encrypt data last, ie: after all data-reduction technologies have been applied. However, the caveat is that for some customers, with some data, this is simply not possible. If you must encrypt data end-to-end for compliance or business/national security reasons, then by all means, do it. The unfortunate byproduct of that requirement is that you may get very little space savings on that data from de-duplication both in primary storage and in a backup environment. This also affects WAN bandwidth when replicating since encrypted data is difficult to compress and accelerate as well.
The more I talk with customers, the more I find that the technical details of how something works is much less important than the business outcome it achieves. When it comes to storage, most customers just want a device that will provide the capacity and performance they need, at a price they can afford–and it better not be too complicated. Pretty much any vendor trying to sell something will attempt to make their solution fit your needs even if they really don’t have the right products. It’s a fact of life, sell what you have. Along these lines, there has been a lot of back and forth between vendors about dedup vs. compression technology and which one solves customer problems best.
After snapshots and thin provisioning, data reduction technology in storage arrays has become a big focus in storage efficiency lately; and there are two primary methods of data reduction — compression and deduplication.
While EMC has been marketing compression technology for block and file data in Celerra, Unified, and Clariion storage systems, NetApp has been marketing deduplication as the technology of choice for block and file storage savings. But which one is the best choice? The short answer is.. it depends. Some data types benefit most from deduplication while others get better savings with compression.
Currently, EMC supports file compression on all EMC Celerra NS20, 40, 80, 120, 480, 960, VG2, and VG8 systems running DART 5.6.47.x+ and block compression on all CX4 based arrays running FLARE30.x+. In all cases, compression is enabled on a volume/LUN level with a simple check box and processing can be paused, resumed, and disabled completely, uncompressing the data if desired. Data is compressed out-of-band and has no impact on writes, with minimal overhead on reads. Any or all LUN(s) and/or Filesystem(s) can be compressed if desired even if they existed prior to upgrading the array to newer code levels.
With the release of OnTap 8.0.1, NetApp has added support for in-line compression within their FAS arrays. It is enabled per-FlexVol and as far as I have been able to determine, cannot be disabled later (I’m sure Vaughn or another NetApp representative will correct me if I’m wrong here.) Compression requires 64-bit aggregates which are new in OnTap 8, so FlexVols that existed prior to an upgrade to 8.x cannot be compressed without a data migration which could be disruptive. Since compression is inline, it creates overhead in the FAS controller and could impact performance of reads and writes to the data.
Vaughn Stewart, of NetApp, expertly blogged today about the new compression feature, including some of the caveats involved, and to me the most interesting part of the post was the following graphic he included showing the space savings of compression vs. dedup for various data types.
Yesterday, In his blog posted entitled “Myth Busting: Storage Guarantees“, Vaughn Stewart from NetApp blogged about the EMC 20% Guarantee and posted a chart of storage efficiency features from EMC and NetApp platforms to illustrate his point. Chuck Hollis from EMC called it “chartsmithing” in comment but didn’t elaborate specifically on the charts deficiencies. Well allow me to take that ball…
As presented, Vaughn’s chart (below) is technically factual (with one exception which I’ll note), but it plays on the human emotion of Good vs Bad (Green vs Red) by attempting to show more Red on EMC products than there should be.
The first and biggest problem is the chart compares EMC Symmetrix and EMC Clariion dedicated-block storage arrays with NetApp FAS, EMC Celerra, and NetApp vSeries which are all Unified storage systems or gateways. Rather than put n/a or leave the field blank for NAS features on the block-only arrays, the chart shows a resounding and red NO, leading the reader to assume that the feature should be there but somehow EMC left it out.
As far as keeping things factual, some of the EMC and NetApp features in this chart are not necessarily shipping today (very soon though, and since it affects both vendors I’ll allow it here). And I must make a correction with respect to EMC Symmetrix and Space Reclamation, which IS available on Symm today.
I’ve taken the liberty of massaging Vaughn’s chart to provide a more balanced view of the feature comparison. I’ve also added EMC Celerra gateway on Symmetrix to the comparison as well as an additional data point which I felt was important to include.
1.) I removed the block only EMC configuration devices because the NetApp devices in the comparison are Unified systems.
2.) I removed the SAN data row for Single Instance storage because Single Instance (identical file) data reduction technology is inherently NAS related.
3.) Zero Space Reclamation is a feature available in Symmetrix storage. In Clariion, the Compression feature can provide a similar result since zero pages are compressible.
I left the 3 different data reduction techniques as individually listed even though the goal of all of them is to save disk space. Depending on the data types, each method has strengths and weaknesses.
One question, if a bug in OnTap causes a vSeries to lose access to the disk on a Symmetrix during an online Enginuity upgrade, who do you call? How would you know ahead of time if EMC hasn’t validated vSeries on Symmetrix like EMC does with many other operating systems/hosts/applications in eLab?
The goal if my post here really is to show how the same data can be presented in different ways to give readers a different impression. I won’t get into too much as far as technical differences between the products, like how comparing FAS to Symmetrix is like comparing a box truck to a freight train, or how fronting an N+1 loosely coupled clustered, global cached, high-end storage array with a midrange dual-controller gateway for block data might not be in a customer’s best interest.
What do you think?
Well, not exactly. What you really need is a restore solution!
I was discussing this with a colleague recently as we compared difficulties multiple customers are having with backups in general. My colleague was relating a discussion he had with his customer where he told them, “stop thinking about how to design a backup solution, and start thinking about how to design a restore solution!”
Most of our customers are in the same boat, they work really hard to make sure that their data is backed up within some window of time, and offsite as soon as possible in order to ensure protection in the event of a catastrophic failure. What I’ve noticed in my previous positions in IT and more so now as a technical consultant with EMC is that (in my experience) most people don’t really think about how that data is going to get restored when it is needed. There are a few reasons for this:
- Backing up data is the prerequisite for a restore; IT professionals need to get backups done, regardless of whether they need to restore the data. It’s difficult to plan for theoretical needs and restore is still viewed, incorrectly, as theoretical.
- Backup throughput and duration is easily measured on a daily basis, restores occur much more rarely and are not normally reported on.
- Traditional backup has been done largely the same way for a long time and most customers follow the same model of nightly backups (weekly full, daily incremental) to disk and/or tape, shipping tape offsite to Iron Mountain or similar.
I think storage vendors, EMC and NetApp particularly, are very good at pointing out the distinction between a backup solution and a restore solution, where backup vendors are not quite as good at this. So what is the difference?
When designing a backup solution the following factors are commonly considered:
- Size of Protected Data – How much data do I have to protect with backup (usually GB or TB)
- Backup Window – How much time do I have each night to complete the backups (in hours)
- Backup Throughput – How fast can I move the data from it’s normally location to the backup target
- Applications – What special applications do I have to integrate with (Exchange, Oracle, VMWare)
- Retention Policy – How long do I have to hang on to the backups for policy or legal purposes
- Offsite storage – How do I get the data stored at some other location in case of fire or other disaster
If you look at it from a restore prospective, you might think about the following:
- How long can I afford to be down after a failure? Recovery Time Objective (RTO): This will determine the required restore speed. If all backups are stored offsite, the time to recall a tape or copy data across the WAN affects this as well.
- How much data can I afford to lose if I have to restore? Recovery Point Objective (RPO): This will determine how often the backup must occur, and in many cases this is less than 24 hours.
- Where do I need to restore the application? This will help in determining where to send the data offsite.
Answer these questions first and you may find that a traditional backup solution is not going to fulfill your requirements. You may need to look at other technologies, like Snapshots, Clones, replication, CDP, etc. If a backup takes 8 hours, the restore of that data will most likely take at least 8 hours, if not closer to 16 hours. If you are talking about a highly transactional database, hosting customer facing web sites, and processing millions of dollars per hour, 8 hours of downtime for a restore is going to cost you tens or hundreds of millions of dollars in lost revenue.
Two of my customers have database instances hosted on EMC storage, for example, which are in the 20TB size range. They’ve each architected a backup solution that can get that 20TB database backed up within their backup window. The problem is, once that backup completes, they still have to offsite the backup, and replicate it to their DR site across a relatively small WAN link. They both use compressed database dumps for backup because, from the DBA’s perspective, dumps are the easiest type of backup to restore from, and the compression helps get 20TB of data pushed across 1gbe Ethernet connections to the backup server. One of the customers is actually backing up all of their data to DataDomain deduplication appliances already; the other is planning to deploy DataDomain. The problem in both cases is that, if you pre-compress the backup data, you break deduplication, and you get no benefit from the DataDomain appliance vs. traditional disk. Turning off compression in the dump can’t be done because the backup would take longer than the backup window allows. The answer here is to step back, think about the problem you are trying to solve–restoring data as quickly as possible in the event of failure–and design for that problem.
How might these customers leverage what they already have, while designing a restore solution to meet their needs?
Since they are already using EMC storage, the first step would be to start taking snapshots and/or clones of the database. These snapshots can be used for multiple purposes…
- In the event of database corruption, or other host/filesystem/application level problem, the production volume can be reverted to a snapshot in a matter of minutes regardless of the size of the database (better RTO). Snapshots can be taken many times a day to reduce the amount of data loss incurred in the event of a restore (better RPO).
- A snapshot copy of the database can be mounted to a backup server directly and backed up directly to tape or backup disk. This eliminates the requirement to perform database dumps at all as well as any network bottleneck between the database server and backup server. Since there is no dump process, and no requirement to pre-compress the data, de-duplication (via DataDomain) can be employed most efficiently. Using a small 10gbps private network between the backup media servers and DataDomain appliances, in conjunction with DD-BOOST, throughput can be 2.5X faster than with CIFS, NFS, or VTL to the same DataDomain appliance. And with de-duplication being leveraged, retention can be very long since each day’s backup only adds a small amount of new data to the DataDomain.
- Now that we’ve improved local restore RTO/RPO, eliminated the backup window entirely for the database server, and decreased the amount of disk required for backup retention, we can replicate the backup to another DataDomain appliance at the DR site. Since we are taking full advantage of de-duplication now, the replication bandwidth required is greatly reduced and we can offsite the backup data in a much shorter period of time.
- Next, we give the DBAs back the ability to restore databases easily, and at will, by leveraging EMC Replication Manager. RM manages the snapshot schedules, mounting of snaps to the backup server, and initiation of backup jobs from the snapshot, all in a single GUI that storage admins and DBAs can access simultaneously.
So we leveraged the backup application they already own, the DataDomain appliances they already own, storage arrays they already own, built a small high-bandwidth backup network, and layered some additional functionality, to drastically improve their ability to restore critical data. The very next time they have a data integrity problem that requires a restore, these customer’s will save literally millions of dollars due to their ability to restore in minutes vs. hours.
If RPO’s of a few hours are not acceptable, then a Continuous Data Protection (CDP) solution could be added to this environment. EMC RecoverPoint CDP can journal all database activity to be used to restore to any point in time, bringing data loss (RPO) to zero or near-zero, something no amount of snapshots can provide, and keeping restore time (RTO) within minutes (like snapshots). Further, the journaled copy of the database can be stored on a different storage array providing complete protection for the entire hardware/software stack. RecoverPoint CDP can be combined with Continuous Remote Replication (CRR) to replicate the journaled data to the DR site and provide near-zero RPO and extremely low RTO in a DR/BC scenario. Backups could be transitioned to the DR site leveraging the RecoverPoint CRR copies to reduce or eliminate the need to replicate backup data. EMC Replication Manager manages RecoverPoint jobs in the same easy to use GUI as snapshot and clone jobs.
There are a whole host of options available from EMC (and other storage vendors) to protect AND restore data in ways that traditional backup applications cannot match. This does not mean that backup software is not also needed, as it usually ends up being a combined solution.
The key to architecting a restore solution is to start thinking about what would happen if you had to restore data, how that impacts the business and the bottom line, and then architect a solution that addresses the business’ need to run uninterrupted, rather than a solution that is focused on getting backups done in some arbitrary daily/nightly window.
This past week, during EMC World 2010 in Boston, EMC made several announcements of updates to the Celerra and CLARiiON midrange platforms. Some of the most impressive were new capabilities coming to CLARiiON FLARE in just a couple short months. Major updates to Celerra DART will coincide with the FLARE updates and if you are already running CLARiiON CX4 hardware, or are evaluating CX4 (or Celerra), you will want to check these new features out. They will be available to existing CX4(120,240,480,960)/NS(120,480,960) systems as part of a software update.
Here’s a list of key changes in FLARE 30:
- Unified management for midrange storage platforms including CLARiiON and Celerra today, plus RecoverPoint, Replication Manager and more in the future. This is a true single pane of glass for monitoring AND managing SAN, NAS, and data protection and it’s built in to the platform. ”EMC Unisphere” replaces Navisphere Manager and Celerra Manager and supports multiple storage systems simultaneously in a single window. (Video Demo)
- Extremely large cache (ie: FASTCache) – Up to 2TB of additional read/write cache in CLARiiON using SSDs (Video Demo)
- Block level Fully Automated Storage Tiering (ie: sub-LUN FAST) – Fully automated assignment of data across multiple disk types
- Block Level Compression – Compress LUNs in the CLARiiON to reduce disk space requirements
- VAAI Support – Integrate with vSphere ESX for improved performance
These features are in addition to existing features like:
- Seamless and non-disruptive mobility of LUNs within a storage array – (via Virtual LUNs)
- Non-Disruptive Data Migration – (via PowerPath Migration Enabler)
- VMWare Aware Storage Management – (Navisphere, Unisphere, and vSphere Plugins giving complete visibility and self-service provisioning for VMWare admins (Video Demo) AND Storage Admins
- CIFS and NFS Compression – Compress production data on Celerra to reduce disk space requirements including VMs
- Dynamic SAN path load balancing – (via PowerPath)
- At-Rest-Encryption – (via PowerPath w/RSA)
- SSD, FC, and SATA drives in the same system – Balance performance and capacity as needed for your application
- Local and Remote replication with array level consistency – (SnapView, MirrorView, etc)
- Hot-swap, Hot-Add, Hot-Upgrade IO Modules – Upgrade connectivity for FC, FCoE, and iSCSI with no downtime
- Scale to 1.8PB of storage in a single system
- Simultaneously provide FC, iSCSI, MPFS, NFS, and CIFS access
All together, this is an impressive list of features for a single platform. In fact, while many of EMC’s competitors have similar features, none of them have all of them in the same platform, or leverage them all simultaneously to gain efficiency. When CLARiiON CX4 and Celerra NS are integrated and managed as a single Unified storage system with EMC Unisphere there is tremendous value as I’ll point out below…
Improve Performance easily…
- Install a couple SSD drives into a CLARiiON and enable FASTCache to increase the array’s read/write cache from the industry competive 4GB-32GB up to 2TB of array based non-volatile Read AND Write cache available to ALL applications including NAS data hosted by the array.
- Install PowerPath on Windows, Linux, Solaris, AND VMWare ESX hosts to automatically balance IO across all available paths to storage. PowerPath detects latency and queuing occuring on each path and adjusts automatically, improving performance at the storage array AND for your hosts. This is a huge benefit in VMWare environments especially.
- When VMWare releases the updated version of vSphere ESX that supports VAAI, ESX will be able to leverage VAAI support in the CLARiiON to reduce the amount of IO required to do many tasks, improving performance across the environment again.
- Upgrade from 1gbe iSCSI to 10gbe iSCSI, or from 4gbe FiberChannel to 8gbe FiberChannel, without a screwdriver or downtime.
- Provide NAS shared file access with block-level performance for any application using EMC’s MPFS protocol.
Improve Efficiency and cost easily…
- Create a single pool of storage containing some SSD, some FC, and some SATA drives, that automatically monitors and moves portions of data to the appropriate disk type to both improve performance AND decrease cost simultaneously.
- Non-disruptively compress volumes and/or files with a single click to save 50% of your disk space in many cases.
- Convert traditional LUNs to more efficient Thin-LUNs non-disruptively using PowerPath Migration Enabler, saving more disk space.
Increase and Manage Capacity easily…
- Add additional storage non-disruptively with SSD, FC, and SATA drives in any mix up to 1.8PB of raw storage in a single CLARiiON CX4.
- Using FASTCache, iSCSI, FC, and FCoE connectivity simultaneously does not reduce total capacity of the system.
- Expanding LUNs, RAID Groups, and Storage Pools is non-disruptive.
- Migrating LUNs between RAID groups and/or Storage Pools is non-disruptive using built-in CLARiiON LUN Migration, as is migrating data to a different storage array (using PowerPath Migration Enabler)!
- Balancing workload between storage processors is non-disruptive and at individual LUN granularity.
Protect your data easily…
- Snapshot, Clone, and Replicate any of the data to anywhere with built in array tools that can maintain complete data consistency across a single, or multiple applications without installing software.
- Maintain application consistency for Exchange, SQL, Oracle, SAP, and much more, even within VMWare VMs, while replicating to anywhere with a single pane-of-glass.
- Encrypt sensitive data seamlessly using PowerPath Encryption w/RSA.
- While you can do all of these things quickly and simply, you still have the flexibility to create traditional RAID sets using RAID 0, 1, 5, 6, and 10 where you need highly predicable performance, or tune read and write cache at the array and LUN level for specific workloads. Do you want read/write snapshots? How about full copy clones on completely separate disks for workload isolation and failure protection? What about the ability to rollback data to different points in time using snapshots without deleting any other snapshots? EMC Storage arrays have been able to do this for a long time and that hasn’t changed.
There are few manufacturers aside from EMC that can provide all of these capabilities, let alone provide them within a single platform. That’s the definition of simple, efficient, Unified Storage in my opinion.
Okay, now that I’ve talked about backing up the datacenter with NetBackup and DataDomain, and backing up remote sites with NetBackup and PureDisk, it’s time to discuss how to get all that data offsite to protect against a catastrophic event at the datacenter.
As mentioned before we have a primary datacenter with the majority of our systems including the backup environment, and a secondary “disaster recovery” datacenter to which we replicate tier 1 applications for business continuity purposes. Since we really wanted to get away from using tapes and instead store the backups on disk in our datacenter we have a second backup environment in the DR datacenter and we replicate the backup data there.
There are several ways to replicate backup data between two sites but most of them have drawbacks..
1.) Duplicate the backup data from disk to tape and ship the tapes to the remote site to be ready for restore. This is the easiest and probably cheapest way. But there’s that pesky tape yet again with it’s media handling and shipping. And restore could take a while since you have to deal with restoring the catalog from tape, then importing the media, etc.
2.) Duplicate the backup data directly from the local disk to the disk in the second location across the WAN. This is not very feasible with any significant amount of data because every byte of data that is backed up in the datacenter has to be copied across the much slower WAN. It could take many days to duplicate a single nights’ backup. You’d also need a special Catalog backup job that wrote to a storage device across the WAN. The good here is that the backup application knows there is a second copy of the data and knows how to find it.
3.) Replicate the data with the backup storage devices’s native replication. Whether it’s PureDisk, Avamar, or DataDomain, pretty much every source-based or target-based deduplication solution has replication built in that leverages the deduplication to reduce the amount of data that traverses the WAN. The advantage here is that you can have a copy of all of your backup data in a second location in a much shorter time than a traditional copy process. If your deduplication device stores the data with 10:1 compression, then your WAN usage is reduced by 90%. The savings in practice is actually better than that. The drawback is that the backup application (hence the catalog) has no knowledge that there is a second copy of the backup data and after recovering the catalog, you would need to import all of the disk-based media which could take a long time.
4.) Leverage NetBackup Lifecycle Policies with Symantec OpenSTorage (OST) and an OST-capable backup storage system like DataDomain or PureDisk(with PDDO). Basically this has all the advantages of option #2, where it is a catalog-aware duplication, combined with the advantages of WAN bandwidth savings from option #3. Time to copy the data offsite is much shorter due to deduplication, and time to restore is very fast since the data is already in the catalog and available on the disk.
OpenSTorage (OST) is a network protocol that Symantec developed to interface with disk-based backup storage systems and DataDomain was an early adopter of OST. OST allows Netbackup to control replication between OST-capable storage systems and keep track of the replicated copies of backups in the Catalog just as if Netbackup had made both copies itself. OST is also used as the protocol to send the backup data to the storage device as opposed to CIFS/NFS or VTL. DataDomain appliances support OST as does PureDisk when used in conjunction with the PDDO option discussed earlier. In NetBackup, replication controlled by OST is called “optimized duplication” and is controlled primarily through Lifecycle Policies.
Traditionally, when creating NetBackup job policies, the administrator will specify a Storage Unit (either a disk storage unit or a tape library or drive) that the job policy will send backups to. Lifecycle Policies are treated like Storage Units as far as the Job Policy is concerned but the Lifecycle Policy includes a list of storage units, each with it’s own data retention, that the backup data must be stored onto in order for NetBackup to consider the data fully protected. Typically there is a “Backup” target which is where the actual data coming from the client is stored, followed by one or more “Duplication” targets. After the backup job completes, NetBackup will copy the backup data from the “backup” location to all of the “duplication” locations. This works with pretty much any type of storage and you can mix and match tape and disk in the same policy. Since these are duplication operations, NetBackup will read ALL of the data from the backup location, and write ALL of the data to each duplication location. This can take a long time even on the local network and trying to offsite a lot of data over the WAN is not very feasible.
With OST, the lifecycle policy operates exactly the same except that it uses “optimized duplication”, instructing the storage device to copy the file rather than performing the copy through a media server. So in the case of DataDomain, OST issues the command to the DDR, the DDR then copies the file to the second DDR in the remote site and gets all the benefits of deduplication and compression between the two. The media server doesn’t actually do any work. Once the duplication is complete, the DDR notifies NetBackup and the catalog is updated with a record of the second copy of the backup. Lifecycle Policies are fully automated, you can’t even restart a failed duplication, so in the event of a transient failure like a WAN hiccup NetBackup will retry a duplication job forever until it succeeds in order to satisfy the lifecycle policy.
As you can probably surmise, this is REALLY nice for a tape-less backup environment. Our DD690 offsites over 9TB of data every night DURING the backup window. When the last backup job completes, the offsite copies are complete within 30 minutes. And there is absolutely no management of the offsite process or duplication jobs besides configuring the lifecycle policies up front. The drawback to regular Netbackup lifecycle policies is that all duplications are taken from the initial backup copy which limits what you can do with the copies.
Enter NetBackup 6.5.4… Despite the small 6.5.3 -> 6.5.4 version number change, the 6.5.4 release had quite a few new features added. The biggest one was a revamping of the Lifecycle Policy engine to allow for nested duplications. Now you can create a copy of a backup, then create multiple copies from the copy, then create copies from the other copies. Why is this useful?
Remember when I discussed using NetBackup with PDDO to backup remote sites? Well the data backed up from the remote site is all stored in the primary datacenter and we need to get the second copy to the DR datacenter. Plus, we wanted to have a small cache of recently backed up data sitting on the remote media server for fast restore. Well, nested lifecycle’s are the key. The lifecycle writes the initial backup copy onto the media server’s local disk which is configured as a capacity-managed staging area (ie: it stores as much as it can and expires data when it needs more space for new backups). The lifecycle then creates a duplicate of the backup onto the PureDisk storage unit in the primary datacenter. Since bandwidth to the remote site is very limited we don’t want to copy it from the remote site twice so the lifecycle has a second duplication nested under the first to copy it to the DR datacenter. The source of the second copy is the primary datacenter copy, NOT the remote media server copy.
Where else can we use this? Let’s consider our tape-less datacenter backups.. We backup the clients to the DataDomain in our primary datacenter, then using a lifecycle policy and OST, create a copy on the DataDomain in the DR datacenter. If we also wanted to have a tape copy for long term archive or vaulting we could create a nested duplication to make a copy to a tape library in the DR datacenter from the disk copy that is also in the DR datacenter. Without nested lifecycle’s the only workable solution would be to create the tape in the primary datacenter. Every copy of the backup made via the lifecycle policy whether it is using OST or not is maintained by the catalog and easily used for restore. Furthermore, using OST as the protocol between Netbackup and DataDomain actually increases throughput to the DataDomain DDR systems by approximately 2X vs VTL/CIFS/NFS.
Now to the caveats.. Optimized duplication via OST is only available when you are using OST as the protocol between the media server and the storage unit. This means it doesn’t work with VTL even when the DataDomain IS the VTL. OST only works over an ethernet network which is why we skipped VTL completely and used 10gbps networks for the DDR connections. We even skipped VTL/Tape for the NAS systems, connected them directly to the 10gbps network and use 3-way NDMP to backup them up over the network, through the media servers, to the DataDomain. We get the benefit of lifecycle policies, optimized duplication, and I may have mentioned before–no pesky tape even with NDMP/NAS backups. And the interesting thing is that with the 10gbps connection, the NDMP dumps are faster than direct fiber to tape.
There were other enhancements to NetBackup 6.5.4 centered around OST functionality but the lifecycle policy improvements were huge in my opinion.
To cover the catalog replication, we run Netbackup hot catalog backups to a CIFS share that is hosted by the DataDomain. The DDR replicates that share using DataDomain native replication to the DDR in the DR datacenter where the same data is available via a similar CIFS share. Our standby Netbackup master server is already connected to the CIFS share for catalog restore and connected to the DDR via OST. A single operation restores the catalog from the replicated copy. In a real disaster we can begin restoring user data within 30 minutes from the DR datacenter.
In a previous post I discussed the new backup environment I’ve been deploying, what solutions we picked, and how they apply to the datacenter. But I also mentioned that we had remote sites with systems we need to back up but I didn’t explain how we addressed them. Frankly, the previous post was getting long and backing up remote offices is tricky so it deserved it’s own discussion.
Now that we had Symantec NetBackup running in the datacenter, backup up the bulk of our systems to disk by way of DataDomain, we need to look at remote sites. For this we deployed Symantec NetBackup PureDisk. Despite the fact that it has NetBackup in the name, PureDisk is an entirely different product with it’s own servers, clients, and management interfaces. There are some integration points that are not-obvious at first but become important later. Essentially PureDisk is two solutions in a single product — 1:) a “source-dedupe” backup solution that can be deployed independent of any other solution, and 2:) a “target-dedupe” backup storage appliance specifically integrated with the core NetBackup product via an option called PDDO.
As previously discussed, backing up a remote site across a WAN is best accomplished with a source-dedupe solution like PureDisk or Avamar. This is exactly what we intended to do. Most of our remote site clients are some flavor of UNIX or Windows and installing PureDisk clients was easily accomplished. Backup policies were created in PureDisk and a little over a day later we had the first full backup complete. All subsequent nightly backups transfer very small amounts of data across the WAN because they are incremental backups AND because the PureDisk client deduplicates the data before sending it to the PureDisk server. The downside to this is that the PureDisk jobs have to scheduled, managed, and monitored from the PureDisk interface, completely separate from the NetBackup administration console. Backups are sent to the primary datacenter and stored on the local PureDisk server, then the backed up data is replicated to the PureDisk server in the DR datacenter using PureDisk native replication. Restores can be run from either of the PureDisk servers but must un-deduplicate the data before sending across the WAN making restores much slower than backups. This was a known issue and still meets our SLAs for these systems.
Our biggest hurdle with PureDisk was the client OS support. Since we have a very diverse environment we ran into a couple clients which had operating systems that PureDisk does not support. Both Netware and x86 versions of Solaris are currently not supported, both of which were running in our remote sites.
We had a few options:
1.) Use the standard NetBackup client at the remote site and push all of the data across the WAN
2.) Deploy a NetBackup media server in the remote site with a tape library and send the tapes offsite
3.) Deploy a NetBackup media server in the remote site with a small DataDomain appliance and replicate
4.) Deploy a NetBackup media server and ALSO use PureDisk via the PDDO option (PureDisk Deduplication Option)
Option 1 is not feasible for any serious amount of data, Option 2 requires a costly tape library and some level of media handling every day, and Option 3 just plain costs too much money for a small remote site.
Option 4, using PDDO, leverages PureDisk’s “target-dedupe” persona and ends up being a very elegant solution with several benefits.
PDDO is a plug-in that installs on a Netbackup media server. The PDDO plug-in deduplicates data that is being backed up by that media server and sends it across the network to a PureDisk server for storage. The beauty of this option is that we were able to put a Netbackup media server in our remote site without any tape or other storage. The data is copied from the client to the media server over the LAN, de-duplicated by PDDO, then sent over the WAN to the datacenter’s PureDisk server. We get the bandwidth and storage efficiencies of PureDisk while using standard NetBackup clients. A byproduct of this is that you get these PureDisk benefits without having to manage the backups in PureDisk’s separate management console. To reduce the effects of the WAN on the performance of the backup jobs themselves, and to make the majority of restores faster, we put some internal disk on the media server that the backup jobs write to first. After the backup job completes to the local disk, NetBackup duplicates the backup data to the PureDisk storage server, then duplicates another copy to the DR datacenter. This is all handled by NetBackup lifecycle policies which became about 1000X more powerful with the 6.5.4 release. I’ll discuss the power of lifecycle policies, specifically with the 6.5.4 release, when I talk about OST later.
So the result of using PureDisk/PDDO/NetBackup together is a seamless solution, completely managed from within NetBackup, with all the client OS support the core NetBackup product has, the WAN efficiencies of source-dedupe, the storage efficiencies of target-dedupe, and the restore performance of local storage, but with very little storage in the remote site.
Remote Site Backup… Done!!
For the near future, I’m considering putting NetBackup media servers with PDDO on VMWare in all of the remote sites so I can manage all of the backups in NetBackup without buying any new hardware at all. This is not technically supported by Symantec but there is no tape/scsi involved so it should work fine. Did I mention we wanted to avoid tape as much as possible?
Incidentally, despite my love for Avamar, I don’t believe they have anything like PDDO available in the Networker/Avamar integration and Avamar’s client OS support, while better than PureDisk’s, is still not quite as good as Netbackup and Networker.
Okay, so how does OST play into NetBackup, PureDisk, PDDO, and DataDomain? What do the lifecycle policies have to do with it? And what is so damned special about lifecycle policies in NetBackup 6.5.4? All that is next…
I support a very diverse environment with a mix of Windows, Netware, Linux, Solaris, and Mac clients running on standard servers as well as VMWare ESX, plus two different brands of NAS, a few iSeries systems, and an Apple XSAN thrown in for good measure. We have hundreds of applications running on these systems including SQL, Oracle, MySQL, Sharepoint, Documentum, and Agile. These applications are mostly contained in our primary datacenter but we also have a few remote datacenters for specific applications and for disaster recovery as well as a couple remote business offices.
Recently I’ve been working on a project to replace our existing backup application with a new one. We were experiencing extremely long backup windows, low throughput per client, and high backup failure rates with our existing solution and it was time to make a change of some kind. The goal was to protect all of our systems regardless of their location with both an onsite backup in our primary datacenter and an offsite copy for disaster recovery purposes. Additionally we wanted to use little or no tape. After research, lots of vendor meetings, a consulting engagement, and lengthy debate we chose Symantec NetBackup with Symantec NetBackup PureDisk and DataDomain. This combination was chosen for several reasons which will become clearer below.
For those of you who are not familiar with these products here’s a brief description..
Symantec Netbackup is a traditional backup solution that is designed to move data from many clients, as fast as possible, to disk or tape. It is similar to EMC Networker, Symantec BackupExec, and any number of other backup products. NetBackup supports a wide variety of clients, NAS devices, applications (SQL, Exchange, etc), as well as tape libraries and disk storage for the backed up data. Since it simply copies all of the data that resides on the client directly to the backup server it is not particularly tuned for backing up remote offices across the WAN but it can easily flood a local LAN during a backup.
Symantec NetBackup PureDisk is currently a separate solution from the base NetBackup product; it is designed specifically for backing up data over the WAN. Puredisk is a “source-dedupe” solution and is very similar in function to EMC’s Avamar product with which I have a long standing love affair. PureDisk performs an incremental-forever style of backup where only the data that changed since the last backup is copied to the backup server. It then uses deduplication technology to reduce the resulting backup dataset down to an even smaller size before it gets copied across the network. The data is collected and stored (in it’s deduplicated form) on the backup server. With this design PureDisk saves network bandwidth as well as disk space on the backup server making it ideal for backups across the WAN, VPN, etc. Symantec’s goal is to merge PureDisk into NetBackup as a single solution at some point probably next year. PureDisk backup servers can replicate backed up data to other PureDisk backup servers in de-duplicated form for redundancy across sites. The downside to PureDisk is that raw throughput on a PureDisk backup server is not high enough for datacenter use and client support is more limited than the standard NetBackup product.
DataDomain (now part of EMC) has been making it’s DDR products for a while now and has been very successful (prompting the recent bidding war between NetApp and EMC to purchase the company). DataDomain appliances are “target-dedupe” devices that are designed to replace tape libraries in traditional backup environments, like Netbackup. The DDR appliance presents itself as a VTL (virtual tape library) via SAN, a CIFS(Windows) file server, and/or a NFS(UNIX) file server making it compatible with pretty much any type of backup system. DataDomain also supports Symantec’s OpenSTorage (OST) API which is available in Netbackup 6.5. The DDR system receives all of the data that Netbackup copies from backup clients, deduplicates the data in real-time, then stores it on it’s own internal disk. Because the DDR is purpose built and has fast processors it can process data at relatively high throughput rates. For example, a single DD690 model is rated at 2.7TB/hour (about 6gbps) when using OST. The deduplication in a DDR provides disk-space savings but does not reduce the amount of data copied from backup clients. DDRs can also replicate data (in deduplicated form) to other DDRs across the LAN or WAN, great for offsite backups.
For an explanation of de-duplication, check out my prior post on the topic..
Two of the challenges we faced when designing the final solution had to do with the cost per TB of DataDomain disk and the slightly limited client OS support of PureDisk. But we had a clean slate to work from–there was no interest in utilizing any of the existing backup infrastructure aside from the two IBM tape libraries we had. We were not required to use the libraries but we wouldn’t be buying new ones if we planned on using tape as part of the new solution.
For the primary datacenter we deployed NetBackup Master and Media servers, a DataDomain DD690, and connected them to each other with Cisco 4900M 10gbps switches. We deployed a warm-standby master server plus a media server and another DD690 in our DR datacenter but did not use 10gbs there due to the additonal cost.
With this set up we covered all of the clients in our primary datacenter. Systems that have large amounts of data (like Microsoft Exchange, SAS Financials, etc) were connected directly to the 4900M switches (via 1gbps connections). Aggregate throughput of the backups during a typical night averages 400-500MB/sec with all of the data going to the DataDomain. The Exchange server’s flood their network links pushing over 100MB/sec per server when backing up the email databases. We currently back up 9TB of data per night with 3 media servers and a single DDR in about 5 hours. Our primary bottlenecks are with the VCB Proxy server (we need more of them) and the aging datacenter core network having an aggregate throughput of a barely more than 1gbps.
But what about those remote sites? What does OST really add? How do you tackle the NAS backups without resorting to tape? All that and more is coming up soon…