You are currently browsing the tag archive for the ‘celerra’ tag.

Yesterday, In his blog posted entitled “Myth Busting: Storage Guarantees“, Vaughn Stewart from NetApp blogged about the EMC 20% Guarantee and posted a chart of storage efficiency features from EMC and NetApp platforms to illustrate his point.  Chuck Hollis from EMC called it “chartsmithing” in comment but didn’t elaborate specifically on the charts deficiencies.  Well allow me to take that ball…

As presented, Vaughn’s chart (below) is technically factual (with one exception which I’ll note), but it plays on the human emotion of Good vs Bad (Green vs Red) by attempting to show more Red on EMC products than there should be.

The first and biggest problem is the chart compares EMC Symmetrix and EMC Clariion dedicated-block storage arrays with NetApp FAS, EMC Celerra, and NetApp vSeries which are all Unified storage systems or gateways.  Rather than put n/a or leave the field blank for NAS features on the block-only arrays, the chart shows a resounding and red NO, leading the reader to assume that the feature should be there but somehow EMC left it out.

As far as keeping things factual, some of the EMC and NetApp features in this chart are not necessarily shipping today (very soon though, and since it affects both vendors I’ll allow it here).  And I must make a correction with respect to EMC Symmetrix and Space Reclamation, which IS available on Symm today.

I’ve taken the liberty of massaging Vaughn’s chart to provide a more balanced view of the feature comparison.  I’ve also added EMC Celerra gateway on Symmetrix to the comparison as well as an additional data point which I felt was important to include.

I’ve included some footnotes in the chart to explain some of the results but I’ll explain a little here as well.

1.) I removed the block only EMC configuration devices because the NetApp devices in the comparison are Unified systems.

2.) I removed the SAN data row for Single Instance storage because Single Instance (identical file) data reduction technology is inherently NAS related.

3.) Zero Space Reclamation is a feature available in Symmetrix storage.  In Clariion, the Compression feature can provide a similar result since zero pages are compressible.

I left the 3 different data reduction techniques as individually listed even though the goal of all of them is to save disk space.  Depending on the data types, each method has strengths and weaknesses.

One question, if a bug in OnTap causes a vSeries to lose access to the disk on a Symmetrix during an online Enginuity upgrade, who do you call?  How would you know ahead of time if EMC hasn’t validated vSeries on Symmetrix like EMC does with many other operating systems/hosts/applications in eLab?

The goal if my post here really is to show how the same data can be presented in different ways to give readers a different impression.  I won’t get into too much as far as technical differences between the products, like how comparing FAS to Symmetrix is like comparing a box truck to a freight train, or how fronting an N+1 loosely coupled clustered, global cached, high-end storage array with a midrange dual-controller gateway for block data might not be in a customer’s best interest.

What do you think?

I started this post before I started working for EMC and got sidetracked with other topics.  Recent discussions I’ve had with people have got me thinking more about orchestration of data protection, replication, and disaster recovery, so it was time to finish this one up…

———————————–

Prior to me coming to work for EMC, I was working on a project to leverage NetApp and EMC storage simultaneously for redundancy.  I had a chance to put various tools from EMC and NetApp into production and have been able to make some observations with respect to some of the differences.  This is a follow up my previous NetApp and EMC posts…

NetApp and EMC: Real world comparisons
NetApp and EMC: Startup and First Impressions
NetApp and EMC: ESX and Exchange 2007 CCR
NetApp and EMC: Exchange 2007 Replication

Specifically this post is a comparison between NetApp SnapManager 5.x and EMC Replication Manager 5.x.  First, here’s a quick background on both tools based on my personal experience using them.

Description

EMC Replication Manager (RM) is a single application that runs on a dedicated “Replication Manager Server.”  RM agents are deployed to the hosts of applications that will be replicated.  RM supports local and remote replication features in EMC’s Clariion storage array, Celerra Unified NAS, Symmetrix DMX/V-Max, Invista, and RecoverPoint products.  With a single interface, Replication Manager lets you schedule, modify, and monitor snapshot, clone, and replication jobs for Exchange, SQL, Oracle, Sharepoint, VMWare, Hyper-V, etc.  RM supports Role-Based authentication so application owners can have access to jobs for their own applications for monitoring and managing replication.  RM can manage jobs across all of the supported applications, array types, and replication technologies simultaneously.  RM is licensed by storage array type and host count. No specific license is required to support the various applications.

NetApp SnapManager is actually a series of applications designed for each application that NetApp supports.  There are versions of SnapManager for Exchange, SQL, Sharepoint, SAP, Oracle, VMWare, and Hyper-V.  The SnapManager application is installed on each host of an application that will be replicated, and jobs are scheduled on each specific host using Windows Task Scheduler.  Each version of SnapManager is licensed by application and host count.  I believe you can also license SnapManager per-array instead of per-host which could make financial sense if you have lots of hosts.

Commonality

EMC Replication Manager and NetApp SnapManager products tackle the same customer problem–provide guaranteed recoverability of an application, in the primary or a secondary datacenter, using array-based replication technologies.  Both products leverage array-based snapshot and replication technology while layering application-consistency intelligence to perform their duties.  In general, they automate local and remote protection of data.  Both applications have extensive CLI support for those that want that.

Differences

  • Deployment
    • EMC RM – Replication Manager is a client-server application installed on a control server.  Agents are deployed to the protected servers.
    • NetApp SM – SnapManager is several applications that are installed directly on the servers that host applications being protected.
  • Job Management
    • EMC RM – All job creation, management, and monitoring is done from the central GUI. Replication Manager has a Java based GUI.
    • NetApp SM – Job creation and monitoring is done via the SnapManager GUI on the server being protected.  SnapManager utilizes an MMC based GUI.
  • Job Scheduling
    • EMC RM – Replication Manager has a central scheduler built-in to the product that runs on the RM Server.  Jobs are initiated and controlled by the RM Server, the agent on the protected server performs necessary tasks as required.
    • NetApp SM – SnapManager jobs are scheduled with Windows Task Scheduler after creation.  The SnapManager GUI creates the initial scheduled task when a job is created through the wizard.  Modifications are made by editing the scheduled task in Windows task scheduler.

So while the tools essentially perform the same function, you can see that there are clear architectural differences, and that’s where the rubber meets the road.  Being a centrally managed client-server application, EMC Replication Manager has advantages for many customers.

Simple Comparison Example: Exchange 2007 CCR cluster
(snapshot and replicate one of the two copies of Exchange data)

With NetApp SnapManager, the application is installed on both cluster nodes, then an administrator must log on to the console on the node that hosts the copy you want to replicate, and create two jobs which run on the same schedule.  Job A is configured to run when the node is the active node, Job B is configured to run when the node is passive.  Due to some of the differences in the settings, I was unable to configure a single job that ran successfully regardless of whether the node was active or passive.  If you want to modify the settings, you either have to edit the command line options in the Scheduled Task, or create a new job from scratch and delete the old one.

With EMC Replication Manager, you deploy the agent to both cluster nodes, then in the RM GUI, create a job against the cluster virtual name, not the individual node.  You define which server you want the job to run on in the cluster, and whether the job should run when the node is passive, active, or both.  All logs, monitoring, and scheduling is done in the same RM GUI, even if you have 50 Exchange clusters, or SQL and Oracle for that matter.  Modifying the job is done by right-clicking on the job and editing the properties.  Modifying the schedule is done in the same way.

So as the number of servers and clusters increases in your environment, having a central UI to manage and monitor all jobs across the enterprise really helps.  But here’s where having a centrally managed application really shines…

But what if it gets complicated?

Let’s say you have a multi-tier application like IBM FileNet, EMC Documentum, or OpenText and you need to replicate multiple servers, multiple databases, and multiple file systems that are all related to that single application.  Not only does EMC Replication Manager support SQL and Filesystems in the same GUI, you can tie the jobs together and make them dependent on each other for both failure reporting and scheduling.  For example, you can snapshot a database and a filesystem, then replicate both of them without worrying about how long the first job takes to complete.  Jobs can start other jobs on completely independent systems as necessary.

Without this job dependence functionality, you’d generally have to create scheduled tasks on each server and have dependent jobs start with a delay that is long enough to allow the first job to complete while as short as possible to prevent the two parts of the application from getting too far out of sync.  Some times the first job takes longer than usual causing subsequent jobs to complete incorrectly.  This is where Replication Manager shows it’s muscle with it’s ability to orchestrate complex data protection strategies, across the entire enterprise, with your choice of protection technologies (CDP, Snapshot, Clone, Bulk Copy, Async, Sync) from a single central user interface.

I’ve been having some fun discussions with one of my customers recently about how to tackle various application problems within the storage environment and it got me thinking about the value of having “options”.  This customer has an EMC Celerra Unified Storage Array that has Fiber Channel, iSCSI, NFS, and CIFS protocols enabled.  This single storage system supports VMWare, SQL, Web, Business Intelligence, and many custom applications.

The discussion was specifically centered on ensuring adequate storage performance for several different applications, each with a different type of workload…

1.)  Web Servers – Primarily VMs with general-purpose IO loads and low write ratios.

2.)  SQL Servers – Physical and Virtual machines with 30-40% write ratios and low latency requirements.

3.)  Custom Application  – A custom application database with 100% random read profiles running across 50 servers.

The EMC Unified solution:

EMC Storage already sports virtual provisioning in order to provision LUNs from large pools of disk to improve overall performance and reduce complexity.  In addition, QoS features in the array can be used to provide guaranteed levels of performance for specific datasets by specifying minimum and maximum bandwidth, response time, and IO requirements on a per-LUN basis.  This can help alleviate disk contention when many LUNs share the same disks, as in a virtual pool.  Enterprise Flash Drives (EFD) are also available for EMC Storage arrays to provide extremely high performance to applications that require it and they can coexist with FC and SATA drives in the same array.  Read and write cache can also be tuned at an array and LUN level to help with specific workloads.  With the updates to the EMC Unified Platform that I discussed previously, Sub-LUN FAST (auto tiering), and FAST Cache (EFD used as array cache) will be available to existing customers after a simple, non-disruptive, microcode upgrade, providing two new ways to tackle these issues.

So which feature should my customer use to address their 3 different applications?

Sub-LUN FAST (Fully Automated Storage Tiering)

Put all of the data into large Virtual Provisioning pools on the array, add a few EFD (SSD) and SATA disks to the mix and enable FAST to automatically move the blocks to the appropriate tier of storage.  Over time the workload would even out across the various tiers and performance would increase for all of the workloads with much fewer drives, saving on power, floor space, cooling, and potentially disk cost depending on the configuration.  This happens non-disruptively in the background.  Seems like a no-brainer right?

For this customer, FAST helps the web server VMs and the general-purpose SQL databases where the workload is predominately read and much of the same data is being accessed repeatedly (high locality of reference).   As long as the blocks being accessed most often are generally the same, day-to-day, automated tiering (FAST) is a great solution.  But what if the workload is much more random?  FAST would want to push all of the data into EFD, which generally wouldn’t be possible due to capacity requirements.  Okay, so tiering won’t solve all of their problems.  What about FAST Cache?

FAST Cache

Exponentially increase the size of the storage array’s read AND write cache with EFD (SSD) disks.  This would improve performance across the entire array for all “cache friendly” applications.

For this customer, increasing the size of write cache definitely helps performance for SQL (50% increase in TPM, 50% better response time as an example) but what about their custom database that is 100% random read?  Increasing the size of read cache will help get more data into cache and reduce the need to go to disk for reads, but the more random the data, the less useful cache is.   Okay, so very large caches won’t solve all of their problems.   EFDs must be the answer right?

EFD Disks

Forget SATA and FC disks; just use EFD for everything and it will be super fast!!   EFD has extremely high random read/write performance, low latency at high loads, and very high bandwidth.  You will even save money on power and cooling.

The total amount of data this customer is dealing with in these three applications alone exceeds 20TB.  To store that much in EFD would be cost prohibitive to say the least.  So, while EFD can solve all of this customer’s technical problems, they couldn’t afford to acquire enough EFD for the capacity requirements.

But wait, it’s not OR, it’s AND

The beauty of the EMC Unified solution is that you can use all of these technologies, together, on the same array, simultaneously.

In this customer’s case, we put FC and SATA into a virtual pool with FAST enabled and provision the web and general-purpose SQL servers from it.  FAST will eventually migrate the least used blocks to SATA, freeing the FC disks for the more demanding blocks.

Next, we extend the array cache using a couple EFDs and FAST Cache to help with random read, sequential pre-fetching, and bursty writes across the whole array.

Finally, for the custom 100% random read database, we dedicate a few EFDs to just that application, snapshot the DB and present copies to each server.  We disable read and write cache for the EFD backed volumes which leaves more cache available to the rest of the applications on the array, further improving total system performance.

Now, if and when the customer starts to see disk contention in the virtual pool that might affect performance of the general-purpose SQL databases, QoS can be tuned to ensure low response times on just the SQL volumes ensuring consistent performance.  If the disks become saturated to the point where QoS cannot maintain the response time or the other LUNs are suffering from load generated by SQL, any of the volumes can be migrated (non-disruptively) to a different virtual pool in the array to reduce disk contention.

Options

If you look at offerings from the various storage vendors, many promote large virtual pools, some also promote large caches of some kind, others promote block level tiering, and a few promote EFD (aka SSDs) to solve performance problems.  But, when you are consolidating multiple workloads into a single platform, you will discover that there are weaknesses in every one of those features and you are going to wish you had the option to use most or all of those features together.

You have that option on EMC Unified.

This past week, during EMC World 2010 in Boston, EMC made several announcements of updates to the Celerra and CLARiiON midrange platforms.  Some of the most impressive were new capabilities coming to CLARiiON FLARE in just a couple short months.  Major updates to Celerra DART will coincide with the FLARE updates and if you are already running CLARiiON CX4 hardware, or are evaluating CX4 (or Celerra), you will want to check these new features out.  They will be available to existing CX4(120,240,480,960)/NS(120,480,960) systems as part of a software update.

Here’s a list of key changes in FLARE 30:

  • Unified management for midrange storage platforms including CLARiiON and Celerra today, plus RecoverPoint, Replication Manager and more in the future.  This is a true single pane of glass for monitoring AND managing SAN, NAS, and data protection and it’s built in to the platform.  ”EMC Unisphere” replaces Navisphere Manager and Celerra Manager and supports multiple storage systems simultaneously in a single window. (Video Demo)
  • Extremely large cache (ie: FASTCache) – Up to 2TB of additional read/write cache in CLARiiON using SSDs (Video Demo)
  • Block level Fully Automated Storage Tiering (ie: sub-LUN FAST) – Fully automated assignment of data across multiple disk types
  • Block Level Compression – Compress LUNs in the CLARiiON to reduce disk space requirements
  • VAAI Support – Integrate with vSphere ESX for improved performance

These features are in addition to existing features like:

  • Seamless and non-disruptive mobility of LUNs within a storage array – (via Virtual LUNs)
  • Non-Disruptive Data Migration – (via PowerPath Migration Enabler)
  • VMWare Aware Storage Management – (Navisphere, Unisphere, and vSphere Plugins giving complete visibility  and self-service provisioning for VMWare admins (Video Demo) AND Storage Admins
  • CIFS and NFS Compression – Compress production data on Celerra to reduce disk space requirements including VMs
  • Dynamic SAN path load balancing – (via PowerPath)
  • At-Rest-Encryption – (via PowerPath w/RSA)
  • SSD, FC, and SATA drives in the same system – Balance performance and capacity as needed for your application
  • Local and Remote replication with array level consistency – (SnapView, MirrorView, etc)
  • Hot-swap, Hot-Add, Hot-Upgrade IO Modules – Upgrade connectivity for FC, FCoE, and iSCSI with no downtime
  • Scale to 1.8PB of storage in a single system
  • Simultaneously provide FC, iSCSI, MPFS, NFS, and CIFS access

All together, this is an impressive list of features for a single platform. In fact, while many of EMC’s competitors have similar features, none of them have all of them in the same platform, or leverage them all simultaneously to gain efficiency.  When CLARiiON CX4 and Celerra NS are integrated and managed as a single Unified storage system with EMC Unisphere there is tremendous value as I’ll point out below…

Improve Performance easily…

  • Install a couple SSD drives into a CLARiiON and enable FASTCache to increase the array’s read/write cache from the industry competive 4GB-32GB up to 2TB of array based non-volatile Read AND Write cache available to ALL applications including NAS data hosted by the array.
  • Install PowerPath on Windows, Linux, Solaris, AND VMWare ESX hosts to automatically balance IO across all available paths to storage.  PowerPath detects latency and queuing occuring on each path and adjusts automatically, improving performance at the storage array AND for your hosts.  This is a huge benefit in VMWare environments especially.
  • When VMWare releases the updated version of vSphere ESX that supports VAAI, ESX will be able to leverage VAAI support in the CLARiiON to reduce the amount of IO required to do many tasks, improving performance across the environment again.
  • Upgrade from 1gbe iSCSI to 10gbe iSCSI, or from 4gbe FiberChannel to 8gbe FiberChannel, without a screwdriver or downtime.
  • Provide NAS shared file access with block-level performance for any application using EMC’s MPFS protocol.

Improve Efficiency and cost easily…

  • Create a single pool of storage containing some SSD, some FC, and some SATA drives, that automatically monitors and moves portions of data to the appropriate disk type to both improve performance AND decrease cost simultaneously.
  • Non-disruptively compress volumes and/or files with a single click to save 50% of your disk space in many cases.
  • Convert traditional LUNs to more efficient Thin-LUNs non-disruptively using PowerPath Migration Enabler, saving more disk space.

Increase and Manage Capacity easily…

  • Add additional storage non-disruptively with SSD, FC, and SATA drives in any mix up to 1.8PB of raw storage in a single CLARiiON CX4.
  • Using FASTCache, iSCSI, FC, and FCoE connectivity simultaneously does not reduce total capacity of the system.
  • Expanding LUNs, RAID Groups, and Storage Pools is non-disruptive.
  • Migrating LUNs between RAID groups and/or Storage Pools is non-disruptive using built-in CLARiiON LUN Migration, as is migrating data to a different storage array (using PowerPath Migration Enabler)!
  • Balancing workload between storage processors is non-disruptive and at individual LUN granularity.

Protect your data easily…

  • Snapshot, Clone, and Replicate any of the data to anywhere with built in array tools that can maintain complete data consistency across a single, or multiple applications without installing software.
  • Maintain application consistency for Exchange, SQL, Oracle, SAP, and much more, even within VMWare VMs, while replicating to anywhere with a single pane-of-glass.
  • Encrypt sensitive data seamlessly using PowerPath Encryption w/RSA.

Maintain Flexibility…

  • While you can do all of these things quickly and simply, you still have the flexibility to create traditional RAID sets using RAID 0, 1, 5, 6, and 10 where you need highly predicable performance, or tune read and write cache at the array and LUN level for specific workloads.  Do you want read/write snapshots? How about full copy clones on completely separate disks for workload isolation and failure protection? What about the ability to rollback data to different points in time using snapshots without deleting any other snapshots?  EMC Storage arrays have been able to do this for a long time and that hasn’t changed.

There are few manufacturers aside from EMC that can provide all of these capabilities, let alone provide them within a single platform.  That’s the definition of simple, efficient, Unified Storage in my opinion.

In the last post, I talked about a project I am involved in right now to deploy NetApp storage alongside EMC for SAN and NAS.  Today, I’m going to talk about my first impressions of the NetApp during deployment and initial configuration.

First Impressions

I’m going to be pretty blunt — I have been working with EMC hardware and software for a while now, and I’m generally happy with the usability of their GUIs.  Over that time, I’ve used several major revisions of Navisphere Manager and Celerra Manager, and even more minor revisions, and I’ve never actually found a UI bug.  To be clear, EMC, IBM, NetApp, HDS, and every other vendor have bugs in their software, and they all do what they can to find and fix them quickly, but I just haven’t personally seen one in the EMC UIs despite using every feature offered by those systems. (I have come across bugs in the firmware)

Contrast that with the first day using the new NetApp, running the latest 7.3.1.1L1 code, where we discovered a UI problem in the first 10 minutes.  When attempting to add disks to an aggregate in FilerView, we could not select FC disk to add.  We could, however, add SATA disk to the FC aggregate.  The only way to get around the issue was to use the CLI via SSH.  As I mentioned in my previous post, our NetApp is actually an IBM nSeries, and IBM claims they perform additional QC before their customers get new NetApp code.

Shortly after that, we found a second UI issue in FilerView.  When creating a new Initiator group, FilerView populates the initiator list with the WWNs that have logged in to it.  Auto-populating is nice but the problem is that FilerView was incorrectly parsing the WWN of the server HBAs and populating the list with NodeWWNs rather than PortWWNs.  We spent several hours trying to figure out why the ESX servers didn’t see any LUNs before we realized that the WWNs in the Initiator group were incorrect.  Editing the 2nd digit on each one fixed the problem.

I find it interesting that these issues, which seemed easy to discover, made it through the QC process of two organizations.  ONTap 7.3.2RC1 is available now, but I don’t know if these issues were addressed.

Manageability

As far as FilerView goes, it is generally easy to use once you know how NetApp systems are provisioned.  The biggest drawback in an HA-Filer setup is the fact you have to open FilerView separately for each Filer and configure each one as a separate storage system.  Two HA-Filer pairs? Four FilerView windows.  If you include the initial launch page that comes up before you get to the actual FilerView window, you double the number of browser windows open to manage your systems.  NetApp likes to mention that they have unified management for NAS and SAN where EMC has two separate platforms, each with their own management tools. EMC treats the two storage processors (SPs) in a Clariion in a much more unified manner, and provisioning is done against the entire Clariion, not per SP.  Further, Navisphere can manage many Clariions in the same UI.  Celerra Manager acts similarly for EMC NAS.  Six of one, half a dozen of the other some say, except that I find that I generally provision NAS storage and SAN storage at different times, and I’d rather have all of the controllers/filers in the same window than NAS and SAN in the same window.  Just my preference.

I should mention, NetApp recently released System Manager 1.0 as a free download.  This new admin tool does present all of the controllers in one view and may end up being a much better tool than FilerView.  For now, it’s missing too many features to be used 100% of the time and it’s Windows only since it’s based on MMC.  Which brings me to my other problem with managing the NetApp.  Neither FilerView nor System Manager can actually do everything you might need to do, and that means you end up in the CLI, FREQUENTLY.  I’m comfortable with CLIs and they are extremely powerful for troubleshooting problems, and especially for scripting batch changes, but I don’t like to be forced into the CLI for general administration.  GUI based management helps prevent possibly crippling typos and can make visualizing your environment easier.  During deployment, we kept going back and forth between FilerView and CLI to configure different things.  Further, since we were using MultiStore (vFilers) for CIFS shares and disaster recovery, we were stuck in the CLI almost entirely because System Manager can’t even see vFilers, and FilerView can only create them and attach volumes.

Had I not been managing Celerra and Clariion for so long, I probably wouldn’t have noticed the above problems.  After several years of configuring CIFS, NFS, iSCSI, Virtual DataMovers, IP Interfaces, Snapshots, Replication, and DR Failover, etc. on Celerra, as well as literally thousands of LUNs for hundreds of servers on Clariion, I don’t recall EVER being forced to use the CLI.  CelerraCLI and NaviCLI are very powerful, and I have written many scripts leveraging them, and I’ll use CLI when troubleshooting an issue.  But for every single feature I’ve ever used on the Celerra or Clarrion, I was able to completely configure from start to finish using the GUI.  Installing a Celerra from scratch even uses a GUI based installation wizard.  Comparing Clariion Storage Groups with NetApp Initiator groups and LUN maps isn’t even fair.  For MS Exchange, I mapped about 50 LUNs to the ESX cluster, which took about 30 minutes in FilerView.  On the Clariion, the same operation is done by just editing the Storage Group and checking each LUN, taking only a couple minutes for the entire process.

Now, all of the above commentary has to do with the management tools, UIs, and to some degree personal preferences, and does not have any bearing on the equipment or underlying functionality.  There are, of course, optional management tools like Operations Manager, Provisioning Manager, and Protection Manager available from NetApp, just as there is Control Center from EMC (which incidentally can monitor the NetApp) or Command Central from Symantec.  Depending on your overall needs, you may want to look at optional management tools; or, FilerView may be perfectly fine.

In the next post,  I’ll get into more specifics about how the Exchange 2007 CCR cluster turned out in this new environment, along with some notes on making CCR truly redundant.  I’ve also been working on the NAS side of the project, so I’ll also post about that some time soon.

I’ve been tasked recently on a project to increase availability of applications through the use of multiple/disparate storage systems.  This environment has heavily invested in EMC Clariion and Celerra storage systems over the past few years and needed a non-EMC platform from which to build the second half of a redundant storage environment.  For various reasons I won’t go into here, we chose IBM nSeries as that second platform. (Since the IBM system is rebranded NetApp FAS, I will refer to this as a NetApp filer.)  I’ve been working on implementing the new equipment as well as integrating it into the Business Continuity strategy.

The overall strategy is to continue to use the EMC Clariion/Celerra systems for production and disaster recovery replication and split applications between and across the two storage platforms for local redundancy.  The NetApp will also perform disaster recovery replication for some of the applications.  Here’s a really simple diagram that might help if the description is confusing:

EMC and NetApp Redundancy

EMC and NetApp Redundancy

Now this may sound easy, but it is, in fact, NOT straightforward.  This strategy requires close coordination with application owners and careful planning.  As we move forward on this project, I’ll talk about various idiosyncrasies, caveats, and problems we’ve faced, how we got around them, and I’ll also talk a lot about the differences between the Clariion/Celerra and NetApp platforms’ features and functionality, application support, and manageability.  These comparisons will include using both systems with FiberChannel connections as well as CIFS/NFS NAS, all in conjunction with DR replication and failover.

To start off, I figure we should compare some of the terminology between EMC and NetApp systems.  Some terms don’t directly translate, but I matched them up as close as I could and noted where there is no equivalent.   Below are two tables: one for Block Storage, and the other for NAS Storage.  Click on them to see full size versions.

EMC-NetApp Block Storage Terminology table

EMC-NetApp Block Storage Terminology

EMC-NetApp NAS Storage Terminology

EMC-NetApp NAS Storage Terminology

In the next update, I’ll start talking about the deployment itself.  The point of these articles is to discuss the differences, advantages, and disadvantages of each platform so that you can understand how each one might work in your environment.  I do not intend to disparage either platform or vendor.  I will try to be vendor agnostic as much as possible, and I do feel like I have a somewhat unique position of comparing new and recent hardware and firmware from both vendors, in the same production capacities, simultaneously, in the same environment.  I am NOT comparing old ONTap code to new FLARE/DART code or vise-versa, nor am I comparing old Clariion CX hardware to new NetApp/IBM hardware, etc.

Stay tuned!

Okay, now that I’ve talked about backing up the datacenter with NetBackup and DataDomain, and backing up remote sites with NetBackup and PureDisk, it’s time to discuss how to get all that data offsite to protect against a catastrophic event at the datacenter.

As mentioned before we have a primary datacenter with the majority of our systems including the backup environment, and a secondary “disaster recovery” datacenter to which we replicate tier 1 applications for business continuity purposes.  Since we really wanted to get away from using tapes and instead store the backups on disk in our datacenter we have a second backup environment in the DR datacenter and we replicate the backup data there.

There are several ways to replicate backup data between two sites but most of them have drawbacks..

1.) Duplicate the backup data from disk to tape and ship the tapes to the remote site to be ready for restore.  This is the easiest and probably cheapest way.  But there’s that pesky tape yet again with it’s media handling and shipping.  And restore could take a while since you have to deal with restoring the catalog from tape, then importing the media, etc.

2.) Duplicate the backup data directly from the local disk to the disk in the second location across the WAN.  This is not very feasible with any significant amount of data because every byte of data that is backed up in the datacenter has to be copied across the much slower WAN.  It could take many days to duplicate a single nights’ backup.  You’d also need a special Catalog backup job that wrote to a storage device across the WAN.  The good here is that the backup application knows there is a second copy of the data and knows how to find it.

3.) Replicate the data with the backup storage devices’s native replication.  Whether it’s PureDisk, Avamar, or DataDomain, pretty much every source-based or target-based deduplication solution has replication built in that leverages the deduplication to reduce the amount of data that traverses the WAN.  The advantage here is that you can have a copy of all of your backup data in a second location in a much shorter time than a traditional copy process.  If your deduplication device stores the data with 10:1 compression, then your WAN usage is reduced by 90%.  The savings in practice is actually better than that.  The drawback is that the backup application (hence the catalog) has no knowledge that there is a second copy of the backup data and after recovering the catalog, you would need to import all of the disk-based media which could take a long time.

4.) Leverage NetBackup Lifecycle Policies with Symantec OpenSTorage (OST) and an OST-capable backup storage system like DataDomain or PureDisk(with PDDO).  Basically this has all the advantages of option #2, where it is a catalog-aware duplication, combined with the advantages of WAN bandwidth savings from option #3.  Time to copy the data offsite is much shorter due to deduplication, and time to restore is very fast since the data is already in the catalog and available on the disk.

OpenSTorage (OST) is a network protocol that Symantec developed to interface with disk-based backup storage systems and DataDomain was an early adopter of OST.  OST allows Netbackup to control replication between OST-capable storage systems and keep track of the replicated copies of backups in the Catalog just as if Netbackup had made both copies itself.  OST is also used as the protocol to send the backup data to the storage device as opposed to CIFS/NFS or VTL.  DataDomain appliances support OST as does PureDisk when used in conjunction with the PDDO option discussed earlier.   In NetBackup, replication controlled by OST is called “optimized duplication” and is controlled primarily through Lifecycle Policies.

Traditionally, when creating NetBackup job policies, the administrator will specify a Storage Unit (either a disk storage unit or a tape library or drive) that the job policy will send backups to.  Lifecycle Policies are treated like Storage Units as far as the Job Policy is concerned but the Lifecycle Policy includes a list of storage units, each with it’s own data retention, that the backup data must be stored onto in order for NetBackup to consider the data fully protected.  Typically there is a “Backup” target which is where the actual data coming from the client is stored, followed by one or more “Duplication” targets.  After the backup job completes, NetBackup will copy the backup data from the “backup” location to all of the “duplication” locations.  This works with pretty much any type of storage and you can mix and match tape and disk in the same policy.  Since these are duplication operations, NetBackup will read ALL of the data from the backup location, and write ALL of the data to each duplication location.  This can take a long time even on the local network and trying to offsite a lot of data over the WAN is not very feasible.

With OST, the lifecycle policy operates exactly the same except that it uses “optimized duplication”, instructing the storage device to copy the file rather than performing the copy through a media server.  So in the case of DataDomain, OST issues the command to the DDR, the DDR then copies the file to the second DDR in the remote site and gets all the benefits of deduplication and compression between the two.  The media server doesn’t actually do any work.  Once the duplication is complete, the DDR notifies NetBackup and the catalog is updated with a record of the second copy of the backup.  Lifecycle Policies are fully automated, you can’t even restart a failed duplication, so in the event of a transient failure like a WAN hiccup NetBackup will retry a duplication job forever until it succeeds in order to satisfy the lifecycle policy.

As you can probably surmise, this is REALLY nice for a tape-less backup environment.  Our DD690 offsites over 9TB of data every night DURING the backup window.  When the last backup job completes, the offsite copies are complete within 30 minutes.  And there is absolutely no management of the offsite process or duplication jobs besides configuring the lifecycle policies up front.  The drawback to regular Netbackup lifecycle policies is that all duplications are taken from the initial backup copy which limits what you can do with the copies.

Enter NetBackup 6.5.4…  Despite the small 6.5.3 -> 6.5.4 version number change, the 6.5.4 release had quite a few new features added.  The biggest one was a revamping of the Lifecycle Policy engine to allow for nested duplications.  Now you can create a copy of a backup, then create multiple copies from the copy, then create copies from the other copies.  Why is this useful?

Remember when I discussed using NetBackup with PDDO to backup remote sites?  Well the data backed up from the remote site is all stored in the primary datacenter and we need to get the second copy to the DR datacenter.  Plus, we wanted to have a small cache of recently backed up data sitting on the remote media server for fast restore.  Well, nested lifecycle’s are the key.  The lifecycle writes the initial backup copy onto the media server’s local disk which is configured as a capacity-managed staging area (ie: it stores as much as it can and expires data when it needs more space for new backups).  The lifecycle then creates a duplicate of the backup onto the PureDisk storage unit in the primary datacenter.  Since bandwidth to the remote site is very limited we don’t want to copy it from the remote site twice so the lifecycle has a second duplication nested under the first to copy it to the DR datacenter.  The source of the second copy is the primary datacenter copy, NOT the remote media server copy.

Where else can we use this?  Let’s consider our tape-less datacenter backups..  We backup the clients to the DataDomain in our primary datacenter, then using a lifecycle policy and OST, create a copy on the DataDomain in the DR datacenter.  If we also wanted to have a tape copy for long term archive or vaulting we could create a nested duplication to make a copy to a tape library in the DR datacenter from the disk copy that is also in the DR datacenter.  Without nested lifecycle’s the only workable solution would be to create the tape in the primary datacenter.  Every copy of the backup made via the lifecycle policy whether it is using OST or not is maintained by the catalog and easily used for restore.  Furthermore, using OST as the protocol between Netbackup and DataDomain actually increases throughput to the DataDomain DDR systems by approximately 2X vs VTL/CIFS/NFS.

Now to the caveats.. Optimized duplication via OST is only available when you are using OST as the protocol between the media server and the storage unit.  This means it doesn’t work with VTL even when the DataDomain IS the VTL.  OST only works over an ethernet network which is why we skipped VTL completely and used 10gbps networks for the DDR connections.  We even skipped VTL/Tape for the NAS systems, connected them directly to the 10gbps network and use 3-way NDMP to backup them up over the network, through the media servers, to the DataDomain.  We get the benefit of lifecycle policies, optimized duplication, and I may have mentioned before–no pesky tape even with NDMP/NAS backups.  And the interesting thing is that with the 10gbps connection, the NDMP dumps are faster than direct fiber to tape.

There were other enhancements to NetBackup 6.5.4 centered around OST functionality but the lifecycle policy improvements were huge in my opinion.

To cover the catalog replication, we run Netbackup hot catalog backups to a CIFS share that is hosted by the DataDomain.  The DDR replicates that share using DataDomain native replication to the DDR in the DR datacenter where the same data is available via a similar CIFS share.  Our standby Netbackup master server is already connected to the CIFS share for catalog restore and connected to the DDR via OST.  A single operation restores the catalog from the replicated copy.  In a real disaster we can begin restoring user data within 30 minutes from the DR datacenter.

This is the 3rd part of a multi-part discussion on capacity vs performance in SAN environments.  My previous post discussed the use of thin provisioning to increase storage utilization.  Today we are going to focus on a newer technology called Data De-Duplication.

Data De-Duplication can be likened to an advanced form of compression.  It is a way to store large amounts of data with the least amount of physical disk possible.

De-duplication technology was originally targeted at lowering the cost of disk-based backup.  DataDomain (recently acquired by EMC Corp) was a pioneer in this space.  Each vendor has their own implementation of de-duplication technology but they are generally similar in that they take raw data, look for similarities in relatively small chunks of that data and remove the duplicates.  The diagram below is the simplest one I could find on the web.  You can see that where there were multiple C, D, B, etc blocks in the original data, the final “de-duplicated” data has only one of each.  The system then stores metadata (essentially pointers) to track what the original data looked like for later reconstruction.

Image from eWeek Article

The first and most widely used implementations of de-dupe technology were in the backup space where much of the same data is being backed up during each backup cycle and many days of history (retention) must be maintained.  Compression ratios using de-duplication alone can easily exceed 10:1 in backup systems.  The neat thing here is that when the de-duplication technology works at the block-level (rather than file-level) duplicate data is found across completely disparate data-sets.  There is commonality between Exchange email, Microsoft Word, and SQL data for example.  In a 24 hour period, backing up 5.7TB of data to disk, the de-dupe ratio in my own backup environment is 19.2X plus an additional 1.7X of standard compression on the post de-dupe’d data, consuming only 173.9GB of physical disk space.  The entire set of backed up data, totaling 106TB currently stored on disk, consumes only 7.5TB of physical disk.  The benefits are pretty obvious as you can see how we can store large amounts of data in much less physical disk space.

There are numerous de-duplication systems available for backup applications — DataDomain, Quantum DXi, EMC DL3D, NetApp VTL, IBM Diligent, and several others.  Most of these are “target-based” de-duplication systems because they do all the work at the storage layer with the primary benefit being better use of disk space.  They also integrate easily into most traditional backup environments.  There are also “source-based” de-duplication systems — EMC Avamar and Symantec PureDisk are two primary examples.  These systems actually replace your existing backup application entirely and perform their work on the client machine that is being backed up.  They save disk space just like the other systems but also reduce bandwidth usage during the backup which is extremely useful when trying to get backups of data across a slow network connection like a WAN.

So now you know why de-duplication is good, and how it helps in a backup environment.. But what about using it for primary storage like NAS/SAN environments?  Well it turns out several vendors are playing in that space as well.  NetApp was the first major vendor to add de-duplication to primary storage with their A-SIS (Advanced Single Instance Storage) product.  EMC followed with their own implementation of de-duplication on Celerra NAS.  They are entirely different in their implementation but attempt to address the same problem of ballooning storage requirements.

EMC Celerra de-dupe performs file-level single-instancing to eliminate duplicate files in a filesystem, and then uses a proprietary compression engine to reduce the size of the files themselves.  Celerra does not deal with portions of files.  In practice, this feature can significantly reduce the storage requirements for a NAS volume.  In a test I performed recently for storing large amounts of Syslog data, Celerra de-dupe easily saved 90% of the disk space consumed by the logs and it hadn’t actually touched all of the files yet.

NetApp’s A-SIS works at a 4KB block size and compares all data within a filesystem regardless of the data type.  Besides NAS shares, A-SIS also works on block volumes (ie: FiberChannel and iSCSI LUNs) where EMC’s implementation does not.  Celerra has an advantage when working with files which contain high amounts of duplication in very small block sizes (like 50 bytes) since NetApp looks at 4KB chunks.  Celerra’s use of a more traditional compression engine saves more space in the syslog scenario but NetApp’s block level approach could save more space than Celerra when dealing with lots of large files.

The ability to work on traditional LUNs presents some interesting opportunities, especially in a VMWare/Hyper-V environment.  As I mentioned in my previous post, virtual environments have lots of redundant data since there are many systems potentially running the same operating system sharing the disk subsystem.  If you put 10 Windows virtual machines on the same LUN, de-duplication will likely save you tons of disk space on that LUN.  There are limitations to this that prevent the full benefits from being realized.  VMWare best practices require you to limit the number of virtual machine disks sharing the same SAN lun for performance reasons (VMFS locking issues) and A-SIS can only de-dupe data within a LUN but not across multiple LUNs.  So in a large environment your savings are limited.  NetApp’s recommendation is to use NFS NAS volumes for VMWare instead of FC or iSCSI LUNs because you can eliminate the VMFS locking issue and place many VMs on a single very large NFS volume which can then be de-duplicated.  Unfortunately there are limits on certain VMWare features when using NFS so this may not be an option for some applications or environments.  Specifically, VMWare Site Recovery Manager which coordinates site-to-site replication and failover of entire VMWare environments does not currently support NFS as of this writing.

When it comes to de-duplication’s impact on performance it’s kind of all over the map.  In backup applications, most target based systems either perform the work in memory while the data is coming in or as a post-process job that runs when the backups for that day have completed.  In either case, the hardware is designed for high throughput and performance is not really a problem.  For primary data, both EMC and NetApp’s implementations are post-process and do not generally impact write performance.  However, EMC has limitations on the size of files that can be de-duplicated before a modification to a compressed file causes a significant delay.  Since they also limit de-duplication to files that have not been accessed or modified for some period of time, the problem is minimal in most environments.  NetApp claims to have little performance impact to either read or writes when using A-SIS.  This has much to do with the architecture of the NetApp WAFL filesystem and how A-SIS interacts with it but it would take an entirely new post to describe how that all works.  Suffice it to say that NetApp A-SIS is useful in more situations than EMC’s Celerra De-duplication.

Where I do see potential problems with performance regardless of the vendor is in the same situation as thin provisioning.  If your application requires 1000 IOPS but you’ve only got 2 disks in the array because of the disk space savings from thin-provisioning and/or de-duplication, the application performance will suffer.  You still need to service the IOPS and each disk has a finite number of IOPS (100-200 generally for FC/SCSI).  Flash/SSD changes the situation dramatically however.

Right now I believe that de-duplication is extremely useful for backups, but not quite ready for prime-time when it comes to primary storage.  There are just too many caveats to make any general recommendations.  If you happen to purchase an EMC Celerra or NetApp FAS/IBM nSeries that supports de-duplication, make sure to read all the best-practices documentation from the vendor and make a decision on whether your environment can use de-duplication effectively, then experiment with it in a lab or dev/test environment.  It could save you tons of disk space and money or it could be more trouble than it’s worth.  The good thing is that it’s pretty much a free option from EMC and NetApp depending on the hardware you own/purchase and your maintenance agreements.

In my previous post, where I discussed the problem of unusable (or slack) disk space on a SAN, I promised a follow-up with techniques on how to increase storage utilization.  I realized that I should discuss some related technologies first and then follow that up with how to put it all together.  So today I start by talking about Thin Provisioning.  I will then follow up with an explanation of De-Duplication and finally talk about how to use multiple technologies together to get the most use out of your storage.

So what is Thin Provisioning?  It is a technology that allows you to create LUNs or Volumes on a storage device such that the LUN/Volume(s) appear to the host or client to be larger than they actually are.   In general, NAS clients and SAN attached hosts see “Thin Provisioned” LUNs just as they see any other LUN but the actual amount of disk space used on the storage device can be significantly smaller than the provisioned size.  How does this help increase storage utilization?  Well, with thin provisioning you provide applications with exactly the storage they want and/or need but you don’t have to purchase all of the disk capacity up front.

Let’s start with a comparison of using standard LUNs vs thin LUNs with a theoretical application set:

Say we have 3 servers, each running Windows Server.  The operating system partition is on local disk and application data drives are on SAN.  Each server runs an application that collects and stores data over time and the application owner expects that over the next year or so the data will grow to 1TB on each server.  In this particular case we also know that the application’s performance requirements are relatively low.

With traditional provisioning we might create 3 LUNs that are 1TB each and present them to the servers.  This provides the application with room for the expected growth.  Using 300GB FC disks we can carve out three 4+1 RAID5 sets, create one LUN in each and it would work fine.  Alternatively we could use wide striping (ie: a MetaLUN on EMC Clariion) and put all three LUNs on the same 15 disks.  Either way we’ve just burned 15 disks on the storage array based on uncertain future requirements.  If we were stingier with storage we could create smaller LUNs (500GB for example) and use LUN expansion technology to increase the size when the application data fills the disk to that capacity.

In the Thin Provisioning world we still create three 1TB LUNs but they would start out by taking no space.  The pool of disk that the LUNs get provisioned from doesn’t even need to have 3TB of capacity.  As the application data grows over the next 12 months or longer the pool size only needs to grow to accommodate the actual amount of data stored.  Depending on the storage array, we can add disks to the pool one at a time.  So on day one we start with 3 disks in the pool, and then add additional disks one by one throughout the year.  We can then create additional LUNs for other applications without adding disks.  As we add disks to the pool, we expand the capacity available for all of the LUNs to grow (up to each LUN’s maximum size) and we increase performance for ALL of the LUNs in the pool since we are adding spindles.  The real-world benefits come as we consolidate numerous LUNs into a single disk pool.

The nice thing about this approach is that we stop managing the size of individual LUNs and just manage the underlying disk pool as a whole.   And the cost-per-GB for SAN disk constantly goes down so we can spend only what we have to today, and when we add more later it will likely be a little cheaper.  Disk capacity utilization will be much higher in a thin model compared with the traditional/thick model.

The story gets even better in a virtual server environment such as with MS Hyper-V or VMWare ESX.  First, the virtual server OS drives are on the SAN in addition to the application data, and there can be multiple virtual disks on the same LUN.  Whether physical or virtual, we need to maintain some free space in the disks to keep applications running, plus with virtual systems we need some free space on the LUN for features of the virtualization technology like snapshots.  The net effect is that in a virtualized environment, disk utilization never gets much above 50% when slack space at both the virtual layer and inside the virtual servers is considered.  With thin provisioning we could potentially store twice the number of virtual servers on the same physical disks.

There are caveats of course.  Maintaining performance is the primary concern.  Whether used in a thick LUN or thin LUN, each disk has a specific amount of performance.   Thin provisioning has no effect on the amount of IOPS or bandwidth the application requires nor the amount of IOPS the physical disk can handle.  So even if thin provisioning saves 50% disk space in your environment, you may not be able to use all of that reclaimed space before running into performance bottlenecks.  If the storage array has QOS features (ie: EMC Clariion NQM) it is possible to prioritize the more important applications in your disk pool to maintain performance where it matters.

Other problems that you may encounter have to do with interoperability.  For starters, some applications are not “thin-friendly”; ie: they write data in such a way as to negate any benefit that thin provisioning provides.  Also, while many storage arrays support thin provisioning, each has different rules about the use of thin LUNs.  For example, in some scenarios you can’t replicate thin LUNs using native array tools.  It pays to do your homework before choosing a new storage array or implementing thin provisioning.

I didn’t cover thin provisioning in NAS environments directly but the feature works in the same manner.  Thin volumes are provisioned from pools of storage and users/clients see a large amount of available disk space even if the disk pool itself is very small.  Since NAS is traditionally used for user home directories and departmental shares, absolute performance is usually not as much of a concern so thin provisioning is much easier to implement and in many cases is the default behavior or simply a check box on NAS appliances like EMC Celerra or NetApp FAS.

Thin provisioning is a powerful technology when used where it makes sense.  In my next post I’ll explain de-duplication technology and then talk about how these technologies can be used together plus some workarounds for the caveats that I’ve mentioned.