You are currently browsing the tag archive for the ‘compression’ tag.
Today, EMC announced the new VNX and VNXe Unified Storage platforms that merge the functionality of, and replaces, EMC’s popular Clariion and Celerra products. VNX is faster, more scalable, more efficient, more flexible, and easier to manage than the platforms it replaces.
Key differences between CX4/NS and VNX:
- VNX replaces the 4gb FC-Arbitrated Loop backend busses with 6gb SAS point-to-point switched backend.
- Fast and Reliable
- VNX supports both 3.5” and 2.5” SAS drives in EFD (SSD), SAS, and NearLine-SAS varieties.
- Flexible and Efficient
- VNX has more cache, more front-end ports, and faster CPUs
- Fast and Flexible
- VNX systems can manage larger FASTCache configurations.
- Fast and Efficient
- VNX builds on the management simplicity enhancements started in EMC Unisphere on CX4/NS by adding application aware provisioning.
- Simple and Efficient
- VNX allows you to start with Block-only or NAS-only and upgrade to Unified later if desired, or start with Unified at deployment.
- Cost Effective and Flexible
- VNX will support advanced data services like deduplication in addition to FASTVP, FASTCache, Block QoS, Compression, and other features already available in Clariion and Celerra.
- Flexible and Efficient
Just as with every manufacturer, newer products take advantage of the latest technologies (faster Intel processors and SAS connectivity in this case,) but that’s only part of the story with VNX.
Earlier, I mentioned Application Aware Provisioning has been added to Unisphere:
Prior to Application Aware Unisphere, if tasked with provisioning storage for Microsoft Exchange (for example), a storage admin would take the mailbox count and size requirements, use best practices and formulas from Microsoft for calculating required IOPS, and then map that data to the storage vendors’ best practices to determine the best disk layout (RAID Type, Size, Speed, quantity, etc). After all that was done, then the actual provisioning of RAID Groups and/or LUNs would be done.
Now with Application Aware Unisphere, the storage admin simply enters the mailbox count and size requirements into Unisphere and the rest is done automatically. EMC has embedded the best practices from Microsoft, VMWare, and EMC into Unisphere and created simple wizards for provisioning Hyper-V, VMWare, NAS, and Microsoft Exchange storage using those best practices.
Combine Unisphere’s Application Aware Provisioning with the already included vCenter integration, and support for VMWare VAAI and you have a broad set of integration from the application layer down through to the storage system for optimum performance, simple and efficient provisioning, and unparalleled visibility. This is especially useful for small to medium sized businesses with small IT departments.
EMC has also simplified licensing of advanced features on VNX. Rather than licensing individual software products based on the exact features you want, VNX has 5 simple Feature Packs plus a few bundle packs. The packs are created based on the overall purpose rather than the feature. ie: Local Protection vs. Snapshots or Clones
- FAST Suite includes FASTVP, FASTCache, Block QoS, and Unisphere Analyzer
- Security and Compliance Pack includes File Level Retention for File and Block Encryption
- Local Protection Pack includes Snapshots for block and file, full copy clones, and RecoverPoint/CDP
- Remote Protection Pack includes Synchronous and Asynchronous replication for block and file as well as RecoverPoint/CRR for near-CDP remote replication of block and.or file data.
- Application Protection Pack extends the application integration by adding Replication Manager for application integrated replication and Data Protection Advisor for SLA based replication monitoring and reporting.
You can also get the Total Protection Pack which includes Local Protection, Remote Protection, and Application Protection packs at a discounted cost or the Total Efficiency Pack which includes all five. That’s it, there are no other software options for VNX/VNXe. Compression and Deduplication are included in the base unit as well as SANCopy. You will also find that the cost of these packs is extremely compelling once you talk with your EMC rep or favorite VAR.
So there you have it — powerful, simple and efficient storage, unified management, extensive data protection features, simplified licensing, and class leading functionality (FASTVP, FASTCache, Integrated CDP, Quality of Service for Block, etc) in a single platform. That’s Unified, That’s EMC VNX.
I didn’t have time to touch on VNXe here but there is even more cool stuff going on there. You can read more about these products here..
I came across this press release today from a company that I wasn’t familiar with and immediately wanted more information. Cirtas Systems has announced support for Atmos-based clouds, including AT&T Synaptic Storage. Whenever I see these types of announcements, I read on in hopes of seeing real fiber channel block storage leveraging cloud-based architectures in some way. So far I’ve been a bit disappointed since the closest I’ve seen has been NAS based systems, at best including iSCSI.
Cirtas BlueJet Cloud Storage Controller is pretty interesting in its own right though. It’s essentially an iSCSI storage array with a cache and a small amount of SSD and SAS drives for local storage. Any data beyond the internal 5TB of usable capacity is stored in “the cloud” which can be an onsite Private Cloud (Atmos or Atmos/VE) and/or a Public Cloud hosted by Amazon S3, Iron Mountain, AT&T Synaptic, or any Atmos-based cloud service provider.
The neat thing with BlueJet is that it leverages a ton of the functionality that many storage vendors have been developing recently such as data de-duplication, compression, some kind of block level tiering, and space efficient snapshots to improve performance and reduce the costs of cloud storage. It seems that pretty much all of the local storage (SAS, SSD, and RAM) is used as a tiered cache for hot data. This gives users and applications the sense of local SAN performance even while hosting the majority of data offsite.
While I haven’t seen or used a BlueJet device and can’t make any observations about performance or functionality, I believe this sort of block->cloud approach has pretty significant customer value. It reduces physical datacenter costs for power and cooling, and it presents some rather interesting disaster recovery opportunities.
Similar to how Compellent’s signature feature, tiered block storage, has been added to more traditional storage arrays, I think modified implementations of Cirtas’ technology will inevitably come from the larger players, such as EMC, as a feature in standard storage arrays. If you consider that EMC Unified Storage and EMC Symmetrix VMAX both have large caches and block- level tiering today, it’s not too much of a stretch to integrate Atmos directly into those storage systems as another tier. EMC already does this for NAS with the EMC File Management Appliance.
I can imagine leveraging FASTCache and FASTVP to tier locally for the data that must be onsite for performance and/or compliance reasons and pushing cold/stale blocks off to the cloud. Additionally, adding cloud as a tier to traditional storage arrays allows customers to leverage their existing investment in Storage, FC/FCoE networks, reporting and performance trending tools, extensive replication options available, and the existing support for VMWare APIs like SRM and VAAI.
With this model, replication of data for disaster recovery/avoidance only needs to be done for the onsite data since the cloud data could be accessed from anywhere. At a DR site, a second storage system connects to the same cloud and can access the cold/stale data in the event of a disaster.
Another option would be adding this functionality to virtualization platforms like EMC VPLEX for active/active multi-site access to SAN data, while only needing to store the majority of the company’s data once in the cloud for lower cost. Customers would no longer have to buy double the required capacity to implement a disaster recovery strategy.
I’m eagerly awating the implementation of cloud into traditional block storage and I can see how some vendors will be able to do this easily, while others may not have the architecture to integrate as easily. It will be interesting to see how this plays out.
My recent post about Compression vs Dedupe, which was sparked by Vaughn’s blog post about NetApp’s new compression feature, got me thinking more about the use of de-duplication and compression at the same time. Can they work together? What is the resulting effect on storage space savings? What if we throw encryption of data into the mix as well?
What is Data De-Duplication?
De-duplication in the data storage context is a technology that finds duplicate patterns of data in chunks of blocks (sized from 4-128KB or so depending on implementation), stores each unique pattern only once, and uses reference pointers in order to reconstruct the original data when needed. The net effect is a reduction in the amount of physical disk space consumed.
What is Data Compression?
Compression finds very small patterns in data (down to just a couple bytes or even bits at a time in some cases) and replaces those patterns with representative patterns that consume fewer bytes than the original pattern. An extremely simple example would be replacing 1000 x “0”s with “0-1000”, reducing 1000 bytes to only 6.
Compression works on a more micro level, where de-duplication takes a slighty more macro view of the data.
What is Data Encryption?
In a very basic sense, encryption is a more advanced version of compression. Rather than compare the original data to itself, encryption uses an input (a key) to compute new patterns from the original patterns, making the data impossible to understand if it is read without the matching key.
Encryption and Compression break De-Duplication
One of the interesting things about most compression and encryption algorithms is that if you run the same source data through an algorithm multiple times, the resulting encrypted/compressed data will be different each time. This means that even if the source data has repeating patterns, the compressed and/or encrypted version of that data most likely does not. So if you are using a technology that looks for repeating patterns of bytes in fairly large chunks 4-128KB, such as data de-duplication, compression and encryption both reduce the space savings significantly if not completely.
I see this problem a lot in backup environments with DataDomain customers. When a customer encrypts or compresses the backup data before it gets through the backup application and into the DataDomain appliance, the space savings drops and many times the customer becomes frustrated by what they perceive as a failing technology. A really common example is using Oracle RMAN or using SQL LightSpeed to compress database dumps prior to backing up with a traditional backup product (such as NetWorker or NetBackup).
Sure LightSpeed will compress the dump 95%, but every subsequent dump of the same database is unique data to a de-duplication engine and you will get little if any benefit from de-duplication. If you leave the dump uncompressed, the de-duplication engine will find common patterns across multiple dumps and will usually achieve higher overall savings. This gets even more important when you are trying to replicate backups over the WAN, since de-duplication also reduces replication traffic.
It all depends on the order
The truth is you CAN use de-duplication with compression, and even encryption. They key is the order in which the data is processed by each algorithm. Essentially, de-duplication must come first. After data is processed by de-duplication, there is enough data in the resulting 4-128KB blocks to be compressed, and the resulting compressed data can be encrypted. Similar to de-duplication, compression will have lackluster results with encrypted data, so encrypt last.
Original Data -> De-Dupe -> Compress -> Encrypt -> Store
There are good examples of this already;
EMC DataDomain – After incoming data has been de-duplicated, the DataDomain appliance compresses the blocks using a standard algorithm. If you look at statistics on an average DDR appliance you’ll see 1.5-2X compression on top of the de-duplication savings. DataDomain also offers an encryption option that encrypts the filesystem and does not affect the de-duplication or compression ratios achieved.
EMC Celerra NAS – Celerra De-Duplication combines single instance store with file level compression. First, the Celerra hashes the files to find any duplicates, then removes the duplicates, replacing them with a pointer. Then the remaining files are compressed. If Celerra compressed the files first, the hash process would not be able to find duplicate files.
So what’s up with NetApp’s numbers?
Back to my earlier post on Dedupe vs. Compression; what is the deal with NetApp’s dedupe+compression numbers being mostly the same as with compression alone? Well, I don’t know all of the details about the implementation of compression in ONTAP 8.0.1, but based on what I’ve been able to find, compression could be happening before de-duplication. This would easily explain the storage savings graph that Vaughn provided in his blog. Also, NetApp claims that ONTAP compression is inline, and we already know that ONTAP de-duplication is a post-process technology. This suggests that compression is occurring during the initial writes, while de-duplication is coming along after the fact looking for duplicate 4KB blocks. Maybe the de-duplication engine in ONTAP uncompresses the 4KB block before checking for duplicates but that would seem to increase CPU overhead on the filer unnecessarily.
Encryption before or after de-duplication/compression – What about compliance?
I make a recommendation here to encrypt data last, ie: after all data-reduction technologies have been applied. However, the caveat is that for some customers, with some data, this is simply not possible. If you must encrypt data end-to-end for compliance or business/national security reasons, then by all means, do it. The unfortunate byproduct of that requirement is that you may get very little space savings on that data from de-duplication both in primary storage and in a backup environment. This also affects WAN bandwidth when replicating since encrypted data is difficult to compress and accelerate as well.
The more I talk with customers, the more I find that the technical details of how something works is much less important than the business outcome it achieves. When it comes to storage, most customers just want a device that will provide the capacity and performance they need, at a price they can afford–and it better not be too complicated. Pretty much any vendor trying to sell something will attempt to make their solution fit your needs even if they really don’t have the right products. It’s a fact of life, sell what you have. Along these lines, there has been a lot of back and forth between vendors about dedup vs. compression technology and which one solves customer problems best.
After snapshots and thin provisioning, data reduction technology in storage arrays has become a big focus in storage efficiency lately; and there are two primary methods of data reduction — compression and deduplication.
While EMC has been marketing compression technology for block and file data in Celerra, Unified, and Clariion storage systems, NetApp has been marketing deduplication as the technology of choice for block and file storage savings. But which one is the best choice? The short answer is.. it depends. Some data types benefit most from deduplication while others get better savings with compression.
Currently, EMC supports file compression on all EMC Celerra NS20, 40, 80, 120, 480, 960, VG2, and VG8 systems running DART 5.6.47.x+ and block compression on all CX4 based arrays running FLARE30.x+. In all cases, compression is enabled on a volume/LUN level with a simple check box and processing can be paused, resumed, and disabled completely, uncompressing the data if desired. Data is compressed out-of-band and has no impact on writes, with minimal overhead on reads. Any or all LUN(s) and/or Filesystem(s) can be compressed if desired even if they existed prior to upgrading the array to newer code levels.
With the release of OnTap 8.0.1, NetApp has added support for in-line compression within their FAS arrays. It is enabled per-FlexVol and as far as I have been able to determine, cannot be disabled later (I’m sure Vaughn or another NetApp representative will correct me if I’m wrong here.) Compression requires 64-bit aggregates which are new in OnTap 8, so FlexVols that existed prior to an upgrade to 8.x cannot be compressed without a data migration which could be disruptive. Since compression is inline, it creates overhead in the FAS controller and could impact performance of reads and writes to the data.
Vaughn Stewart, of NetApp, expertly blogged today about the new compression feature, including some of the caveats involved, and to me the most interesting part of the post was the following graphic he included showing the space savings of compression vs. dedup for various data types.
Yesterday, In his blog posted entitled “Myth Busting: Storage Guarantees“, Vaughn Stewart from NetApp blogged about the EMC 20% Guarantee and posted a chart of storage efficiency features from EMC and NetApp platforms to illustrate his point. Chuck Hollis from EMC called it “chartsmithing” in comment but didn’t elaborate specifically on the charts deficiencies. Well allow me to take that ball…
As presented, Vaughn’s chart (below) is technically factual (with one exception which I’ll note), but it plays on the human emotion of Good vs Bad (Green vs Red) by attempting to show more Red on EMC products than there should be.
The first and biggest problem is the chart compares EMC Symmetrix and EMC Clariion dedicated-block storage arrays with NetApp FAS, EMC Celerra, and NetApp vSeries which are all Unified storage systems or gateways. Rather than put n/a or leave the field blank for NAS features on the block-only arrays, the chart shows a resounding and red NO, leading the reader to assume that the feature should be there but somehow EMC left it out.
As far as keeping things factual, some of the EMC and NetApp features in this chart are not necessarily shipping today (very soon though, and since it affects both vendors I’ll allow it here). And I must make a correction with respect to EMC Symmetrix and Space Reclamation, which IS available on Symm today.
I’ve taken the liberty of massaging Vaughn’s chart to provide a more balanced view of the feature comparison. I’ve also added EMC Celerra gateway on Symmetrix to the comparison as well as an additional data point which I felt was important to include.
1.) I removed the block only EMC configuration devices because the NetApp devices in the comparison are Unified systems.
2.) I removed the SAN data row for Single Instance storage because Single Instance (identical file) data reduction technology is inherently NAS related.
3.) Zero Space Reclamation is a feature available in Symmetrix storage. In Clariion, the Compression feature can provide a similar result since zero pages are compressible.
I left the 3 different data reduction techniques as individually listed even though the goal of all of them is to save disk space. Depending on the data types, each method has strengths and weaknesses.
One question, if a bug in OnTap causes a vSeries to lose access to the disk on a Symmetrix during an online Enginuity upgrade, who do you call? How would you know ahead of time if EMC hasn’t validated vSeries on Symmetrix like EMC does with many other operating systems/hosts/applications in eLab?
The goal if my post here really is to show how the same data can be presented in different ways to give readers a different impression. I won’t get into too much as far as technical differences between the products, like how comparing FAS to Symmetrix is like comparing a box truck to a freight train, or how fronting an N+1 loosely coupled clustered, global cached, high-end storage array with a midrange dual-controller gateway for block data might not be in a customer’s best interest.
What do you think?
This past week, during EMC World 2010 in Boston, EMC made several announcements of updates to the Celerra and CLARiiON midrange platforms. Some of the most impressive were new capabilities coming to CLARiiON FLARE in just a couple short months. Major updates to Celerra DART will coincide with the FLARE updates and if you are already running CLARiiON CX4 hardware, or are evaluating CX4 (or Celerra), you will want to check these new features out. They will be available to existing CX4(120,240,480,960)/NS(120,480,960) systems as part of a software update.
Here’s a list of key changes in FLARE 30:
- Unified management for midrange storage platforms including CLARiiON and Celerra today, plus RecoverPoint, Replication Manager and more in the future. This is a true single pane of glass for monitoring AND managing SAN, NAS, and data protection and it’s built in to the platform. “EMC Unisphere” replaces Navisphere Manager and Celerra Manager and supports multiple storage systems simultaneously in a single window. (Video Demo)
- Extremely large cache (ie: FASTCache) – Up to 2TB of additional read/write cache in CLARiiON using SSDs (Video Demo)
- Block level Fully Automated Storage Tiering (ie: sub-LUN FAST) – Fully automated assignment of data across multiple disk types
- Block Level Compression – Compress LUNs in the CLARiiON to reduce disk space requirements
- VAAI Support – Integrate with vSphere ESX for improved performance
These features are in addition to existing features like:
- Seamless and non-disruptive mobility of LUNs within a storage array – (via Virtual LUNs)
- Non-Disruptive Data Migration – (via PowerPath Migration Enabler)
- VMWare Aware Storage Management – (Navisphere, Unisphere, and vSphere Plugins giving complete visibility and self-service provisioning for VMWare admins (Video Demo) AND Storage Admins
- CIFS and NFS Compression – Compress production data on Celerra to reduce disk space requirements including VMs
- Dynamic SAN path load balancing – (via PowerPath)
- At-Rest-Encryption – (via PowerPath w/RSA)
- SSD, FC, and SATA drives in the same system – Balance performance and capacity as needed for your application
- Local and Remote replication with array level consistency – (SnapView, MirrorView, etc)
- Hot-swap, Hot-Add, Hot-Upgrade IO Modules – Upgrade connectivity for FC, FCoE, and iSCSI with no downtime
- Scale to 1.8PB of storage in a single system
- Simultaneously provide FC, iSCSI, MPFS, NFS, and CIFS access
All together, this is an impressive list of features for a single platform. In fact, while many of EMC’s competitors have similar features, none of them have all of them in the same platform, or leverage them all simultaneously to gain efficiency. When CLARiiON CX4 and Celerra NS are integrated and managed as a single Unified storage system with EMC Unisphere there is tremendous value as I’ll point out below…
Improve Performance easily…
- Install a couple SSD drives into a CLARiiON and enable FASTCache to increase the array’s read/write cache from the industry competive 4GB-32GB up to 2TB of array based non-volatile Read AND Write cache available to ALL applications including NAS data hosted by the array.
- Install PowerPath on Windows, Linux, Solaris, AND VMWare ESX hosts to automatically balance IO across all available paths to storage. PowerPath detects latency and queuing occuring on each path and adjusts automatically, improving performance at the storage array AND for your hosts. This is a huge benefit in VMWare environments especially.
- When VMWare releases the updated version of vSphere ESX that supports VAAI, ESX will be able to leverage VAAI support in the CLARiiON to reduce the amount of IO required to do many tasks, improving performance across the environment again.
- Upgrade from 1gbe iSCSI to 10gbe iSCSI, or from 4gbe FiberChannel to 8gbe FiberChannel, without a screwdriver or downtime.
- Provide NAS shared file access with block-level performance for any application using EMC’s MPFS protocol.
Improve Efficiency and cost easily…
- Create a single pool of storage containing some SSD, some FC, and some SATA drives, that automatically monitors and moves portions of data to the appropriate disk type to both improve performance AND decrease cost simultaneously.
- Non-disruptively compress volumes and/or files with a single click to save 50% of your disk space in many cases.
- Convert traditional LUNs to more efficient Thin-LUNs non-disruptively using PowerPath Migration Enabler, saving more disk space.
Increase and Manage Capacity easily…
- Add additional storage non-disruptively with SSD, FC, and SATA drives in any mix up to 1.8PB of raw storage in a single CLARiiON CX4.
- Using FASTCache, iSCSI, FC, and FCoE connectivity simultaneously does not reduce total capacity of the system.
- Expanding LUNs, RAID Groups, and Storage Pools is non-disruptive.
- Migrating LUNs between RAID groups and/or Storage Pools is non-disruptive using built-in CLARiiON LUN Migration, as is migrating data to a different storage array (using PowerPath Migration Enabler)!
- Balancing workload between storage processors is non-disruptive and at individual LUN granularity.
Protect your data easily…
- Snapshot, Clone, and Replicate any of the data to anywhere with built in array tools that can maintain complete data consistency across a single, or multiple applications without installing software.
- Maintain application consistency for Exchange, SQL, Oracle, SAP, and much more, even within VMWare VMs, while replicating to anywhere with a single pane-of-glass.
- Encrypt sensitive data seamlessly using PowerPath Encryption w/RSA.
- While you can do all of these things quickly and simply, you still have the flexibility to create traditional RAID sets using RAID 0, 1, 5, 6, and 10 where you need highly predicable performance, or tune read and write cache at the array and LUN level for specific workloads. Do you want read/write snapshots? How about full copy clones on completely separate disks for workload isolation and failure protection? What about the ability to rollback data to different points in time using snapshots without deleting any other snapshots? EMC Storage arrays have been able to do this for a long time and that hasn’t changed.
There are few manufacturers aside from EMC that can provide all of these capabilities, let alone provide them within a single platform. That’s the definition of simple, efficient, Unified Storage in my opinion.
In a previous post I discussed the new backup environment I’ve been deploying, what solutions we picked, and how they apply to the datacenter. But I also mentioned that we had remote sites with systems we need to back up but I didn’t explain how we addressed them. Frankly, the previous post was getting long and backing up remote offices is tricky so it deserved it’s own discussion.
Now that we had Symantec NetBackup running in the datacenter, backup up the bulk of our systems to disk by way of DataDomain, we need to look at remote sites. For this we deployed Symantec NetBackup PureDisk. Despite the fact that it has NetBackup in the name, PureDisk is an entirely different product with it’s own servers, clients, and management interfaces. There are some integration points that are not-obvious at first but become important later. Essentially PureDisk is two solutions in a single product — 1:) a “source-dedupe” backup solution that can be deployed independent of any other solution, and 2:) a “target-dedupe” backup storage appliance specifically integrated with the core NetBackup product via an option called PDDO.
As previously discussed, backing up a remote site across a WAN is best accomplished with a source-dedupe solution like PureDisk or Avamar. This is exactly what we intended to do. Most of our remote site clients are some flavor of UNIX or Windows and installing PureDisk clients was easily accomplished. Backup policies were created in PureDisk and a little over a day later we had the first full backup complete. All subsequent nightly backups transfer very small amounts of data across the WAN because they are incremental backups AND because the PureDisk client deduplicates the data before sending it to the PureDisk server. The downside to this is that the PureDisk jobs have to scheduled, managed, and monitored from the PureDisk interface, completely separate from the NetBackup administration console. Backups are sent to the primary datacenter and stored on the local PureDisk server, then the backed up data is replicated to the PureDisk server in the DR datacenter using PureDisk native replication. Restores can be run from either of the PureDisk servers but must un-deduplicate the data before sending across the WAN making restores much slower than backups. This was a known issue and still meets our SLAs for these systems.
Our biggest hurdle with PureDisk was the client OS support. Since we have a very diverse environment we ran into a couple clients which had operating systems that PureDisk does not support. Both Netware and x86 versions of Solaris are currently not supported, both of which were running in our remote sites.
We had a few options:
1.) Use the standard NetBackup client at the remote site and push all of the data across the WAN
2.) Deploy a NetBackup media server in the remote site with a tape library and send the tapes offsite
3.) Deploy a NetBackup media server in the remote site with a small DataDomain appliance and replicate
4.) Deploy a NetBackup media server and ALSO use PureDisk via the PDDO option (PureDisk Deduplication Option)
Option 1 is not feasible for any serious amount of data, Option 2 requires a costly tape library and some level of media handling every day, and Option 3 just plain costs too much money for a small remote site.
Option 4, using PDDO, leverages PureDisk’s “target-dedupe” persona and ends up being a very elegant solution with several benefits.
PDDO is a plug-in that installs on a Netbackup media server. The PDDO plug-in deduplicates data that is being backed up by that media server and sends it across the network to a PureDisk server for storage. The beauty of this option is that we were able to put a Netbackup media server in our remote site without any tape or other storage. The data is copied from the client to the media server over the LAN, de-duplicated by PDDO, then sent over the WAN to the datacenter’s PureDisk server. We get the bandwidth and storage efficiencies of PureDisk while using standard NetBackup clients. A byproduct of this is that you get these PureDisk benefits without having to manage the backups in PureDisk’s separate management console. To reduce the effects of the WAN on the performance of the backup jobs themselves, and to make the majority of restores faster, we put some internal disk on the media server that the backup jobs write to first. After the backup job completes to the local disk, NetBackup duplicates the backup data to the PureDisk storage server, then duplicates another copy to the DR datacenter. This is all handled by NetBackup lifecycle policies which became about 1000X more powerful with the 6.5.4 release. I’ll discuss the power of lifecycle policies, specifically with the 6.5.4 release, when I talk about OST later.
So the result of using PureDisk/PDDO/NetBackup together is a seamless solution, completely managed from within NetBackup, with all the client OS support the core NetBackup product has, the WAN efficiencies of source-dedupe, the storage efficiencies of target-dedupe, and the restore performance of local storage, but with very little storage in the remote site.
Remote Site Backup… Done!!
For the near future, I’m considering putting NetBackup media servers with PDDO on VMWare in all of the remote sites so I can manage all of the backups in NetBackup without buying any new hardware at all. This is not technically supported by Symantec but there is no tape/scsi involved so it should work fine. Did I mention we wanted to avoid tape as much as possible?
Incidentally, despite my love for Avamar, I don’t believe they have anything like PDDO available in the Networker/Avamar integration and Avamar’s client OS support, while better than PureDisk’s, is still not quite as good as Netbackup and Networker.
Okay, so how does OST play into NetBackup, PureDisk, PDDO, and DataDomain? What do the lifecycle policies have to do with it? And what is so damned special about lifecycle policies in NetBackup 6.5.4? All that is next…
This is the 3rd part of a multi-part discussion on capacity vs performance in SAN environments. My previous post discussed the use of thin provisioning to increase storage utilization. Today we are going to focus on a newer technology called Data De-Duplication.
Data De-Duplication can be likened to an advanced form of compression. It is a way to store large amounts of data with the least amount of physical disk possible.
De-duplication technology was originally targeted at lowering the cost of disk-based backup. DataDomain (recently acquired by EMC Corp) was a pioneer in this space. Each vendor has their own implementation of de-duplication technology but they are generally similar in that they take raw data, look for similarities in relatively small chunks of that data and remove the duplicates. The diagram below is the simplest one I could find on the web. You can see that where there were multiple C, D, B, etc blocks in the original data, the final “de-duplicated” data has only one of each. The system then stores metadata (essentially pointers) to track what the original data looked like for later reconstruction.
The first and most widely used implementations of de-dupe technology were in the backup space where much of the same data is being backed up during each backup cycle and many days of history (retention) must be maintained. Compression ratios using de-duplication alone can easily exceed 10:1 in backup systems. The neat thing here is that when the de-duplication technology works at the block-level (rather than file-level) duplicate data is found across completely disparate data-sets. There is commonality between Exchange email, Microsoft Word, and SQL data for example. In a 24 hour period, backing up 5.7TB of data to disk, the de-dupe ratio in my own backup environment is 19.2X plus an additional 1.7X of standard compression on the post de-dupe’d data, consuming only 173.9GB of physical disk space. The entire set of backed up data, totaling 106TB currently stored on disk, consumes only 7.5TB of physical disk. The benefits are pretty obvious as you can see how we can store large amounts of data in much less physical disk space.
There are numerous de-duplication systems available for backup applications — DataDomain, Quantum DXi, EMC DL3D, NetApp VTL, IBM Diligent, and several others. Most of these are “target-based” de-duplication systems because they do all the work at the storage layer with the primary benefit being better use of disk space. They also integrate easily into most traditional backup environments. There are also “source-based” de-duplication systems — EMC Avamar and Symantec PureDisk are two primary examples. These systems actually replace your existing backup application entirely and perform their work on the client machine that is being backed up. They save disk space just like the other systems but also reduce bandwidth usage during the backup which is extremely useful when trying to get backups of data across a slow network connection like a WAN.
So now you know why de-duplication is good, and how it helps in a backup environment.. But what about using it for primary storage like NAS/SAN environments? Well it turns out several vendors are playing in that space as well. NetApp was the first major vendor to add de-duplication to primary storage with their A-SIS (Advanced Single Instance Storage) product. EMC followed with their own implementation of de-duplication on Celerra NAS. They are entirely different in their implementation but attempt to address the same problem of ballooning storage requirements.
EMC Celerra de-dupe performs file-level single-instancing to eliminate duplicate files in a filesystem, and then uses a proprietary compression engine to reduce the size of the files themselves. Celerra does not deal with portions of files. In practice, this feature can significantly reduce the storage requirements for a NAS volume. In a test I performed recently for storing large amounts of Syslog data, Celerra de-dupe easily saved 90% of the disk space consumed by the logs and it hadn’t actually touched all of the files yet.
NetApp’s A-SIS works at a 4KB block size and compares all data within a filesystem regardless of the data type. Besides NAS shares, A-SIS also works on block volumes (ie: FiberChannel and iSCSI LUNs) where EMC’s implementation does not. Celerra has an advantage when working with files which contain high amounts of duplication in very small block sizes (like 50 bytes) since NetApp looks at 4KB chunks. Celerra’s use of a more traditional compression engine saves more space in the syslog scenario but NetApp’s block level approach could save more space than Celerra when dealing with lots of large files.
The ability to work on traditional LUNs presents some interesting opportunities, especially in a VMWare/Hyper-V environment. As I mentioned in my previous post, virtual environments have lots of redundant data since there are many systems potentially running the same operating system sharing the disk subsystem. If you put 10 Windows virtual machines on the same LUN, de-duplication will likely save you tons of disk space on that LUN. There are limitations to this that prevent the full benefits from being realized. VMWare best practices require you to limit the number of virtual machine disks sharing the same SAN lun for performance reasons (VMFS locking issues) and A-SIS can only de-dupe data within a LUN but not across multiple LUNs. So in a large environment your savings are limited. NetApp’s recommendation is to use NFS NAS volumes for VMWare instead of FC or iSCSI LUNs because you can eliminate the VMFS locking issue and place many VMs on a single very large NFS volume which can then be de-duplicated. Unfortunately there are limits on certain VMWare features when using NFS so this may not be an option for some applications or environments. Specifically, VMWare Site Recovery Manager which coordinates site-to-site replication and failover of entire VMWare environments does not currently support NFS as of this writing.
When it comes to de-duplication’s impact on performance it’s kind of all over the map. In backup applications, most target based systems either perform the work in memory while the data is coming in or as a post-process job that runs when the backups for that day have completed. In either case, the hardware is designed for high throughput and performance is not really a problem. For primary data, both EMC and NetApp’s implementations are post-process and do not generally impact write performance. However, EMC has limitations on the size of files that can be de-duplicated before a modification to a compressed file causes a significant delay. Since they also limit de-duplication to files that have not been accessed or modified for some period of time, the problem is minimal in most environments. NetApp claims to have little performance impact to either read or writes when using A-SIS. This has much to do with the architecture of the NetApp WAFL filesystem and how A-SIS interacts with it but it would take an entirely new post to describe how that all works. Suffice it to say that NetApp A-SIS is useful in more situations than EMC’s Celerra De-duplication.
Where I do see potential problems with performance regardless of the vendor is in the same situation as thin provisioning. If your application requires 1000 IOPS but you’ve only got 2 disks in the array because of the disk space savings from thin-provisioning and/or de-duplication, the application performance will suffer. You still need to service the IOPS and each disk has a finite number of IOPS (100-200 generally for FC/SCSI). Flash/SSD changes the situation dramatically however.
Right now I believe that de-duplication is extremely useful for backups, but not quite ready for prime-time when it comes to primary storage. There are just too many caveats to make any general recommendations. If you happen to purchase an EMC Celerra or NetApp FAS/IBM nSeries that supports de-duplication, make sure to read all the best-practices documentation from the vendor and make a decision on whether your environment can use de-duplication effectively, then experiment with it in a lab or dev/test environment. It could save you tons of disk space and money or it could be more trouble than it’s worth. The good thing is that it’s pretty much a free option from EMC and NetApp depending on the hardware you own/purchase and your maintenance agreements.