You are currently browsing the tag archive for the ‘performance’ tag.
Short Answer: Yes!
In my dealings with customers I’ve been requesting performance data from their storage systems whenever I can to see how different applications and environments react to new features. Today I’m going to give you some more real-world data, straight from a customer’s production EMC NS480.
I’ve pulled various stats out of Analyzer for this customer’s Exchange server, which has 3 mail databases totaling about 1TB of mail stored on the NS480 via FibreChannel connect. Since this customer is not extremely large (similar to most of our customers) they are using this NS480 for pretty much everything from VMWare, SQL, and Exchange, to NAS, web/app content, and Business Intelligence systems. There is about 30TB of block data and another 100TB of NAS data. FASTCache is enabled for all LUNs and Pools with just 183GB of usable FASTCache space (4 x 100GB SSDs). So in this environment, with a modest amount of FASTCache and very mixed workload, how does Exchange fare?
Let’s first take a look at the Exchange workload itself for a 24 hour period: (Note: There were no reads from the Exchange log LUNs to speak of so I left that out of this analysis.)
Total Read IOPS for the 3 databases: (the largest peak is a result of database maintenance jobs and the smaller peaks are due to backup jobs) Here it’s tough to see due to the maintenance and backup peaks, but production IO during the work day is about 200-400IOPS. By the way, a source-deduplicating incremental-forever backup technology, such as Avamar, could drastically reduce the IO Load and duration of the nightly backup
Total Write IOPS for the 3 databases: Obviously more changes to the database occurring during the work day.
Total Write IOPS for the 3 Log files: Log data is typically cached easily in the SP cache so FAST Cache isn’t terribly required here but I’m including it to show whether there is any value to using FASTCache with Exchange logs.
Now let’s look at the FASTCache hit ratios for this same set of data: (average of all 3 DBs)
First, the Read Activity: Here you can see that aside from the maintenance and backup jobs, FASTCache is servicing 70-90% of the Read IOPs. Keep in mind that a FASTCache miss could still be a Cache Hit if the data is in SP Cache. What’s interesting about this is that it looks like the nightly maintenance job is pushing the highest load.
And the Write Activity: The beauty of EMC’s FASTCache implementation being a read/write cache, the benefit extends beyond just read IO. Here you see that FASTCache is servicing 60-80% of the writes for these Exchange Databases. That’s a huge load off the backend disks.
And the Log Writes: Since Log writes are usually not a performance problem, I would say that FASTCache is not necessary here, and the average 30% hit ratio shown here is not great. If you wanted to spend the time to tune FASTCache a bit, you might consider disabling FASTCache for Log LUNs to devote the FASTCache capacity to more cache friendly workloads.
All in all you can see that for the database data, FASTCache is servicing a significant portion of the user generated workload, reducing the backend disk load and improving overall performance.
Hopefully this gives you a sense of what FASTCache could do for your Exchange environment, reducing backend disk workload for reads AND writes. I must reiterate, since an SP Cache hit is shown as a FASTCache miss, an 80% FASTCache hit ratio does not mean that 20% of the IOs are hitting disk. To illustrate this, I’ve graphed the sum of SP Cache Hits and FAST Cache Hits for a single database. You can see that in many cases we’re hitting a total of 100% cache hits.
Most interesting is the backup window where SP Cache is really handling a huge amount of the load. This is actually due to the Prefetch algorithms kicking in for the sequential read profile of a backup, something CX/VNX is very good at.
One of the features that has been added to Analyzer (Navisphere and Unisphere) in recent versions is the ability to search for specific LUNs based on criteria. This feature is actually pretty powerful because the criteria itself is pretty flexible. For example, you can search for all LUNs attached to a specific host, or with a specific set of characters in the LUN name. In addition you can search against performance metrics like Throughput, Response Time, or LUN Utilization. This is where it gets interesting because you can look for poorly performing LUNs really quickly. In the following example, I am going to build a search that looks for LUNs that have EX in the name (since all of my Exchange server LUNs have EX in the name) that ALSO have high LUN utilization for several polling intervals.
Once you’ve launched Analyzer and opened an Archive, click on the binocular icon in the tool bar to bring up the search dialog.
You can choose a predefined search (a search you previously created and saved) or a new Object Based Query. In this example we are going to build a new query so select “Object Based Query” and choose All LUNs in the drop down box. If you wanted, you could narrow down the search to just Pool Based LUNs, just MetaLUNs, or Component LUNs, etc.)
Next we’ll define the LUN criteria by selecting the Name property, choosing Contains, and entering the “EX” value. This will filter the search to only those LUNs that have EX in the name. Finally we’ll set a threshold. In this example, I’m looking for LUNs that have a LUN Utilization value over greater than 90% for at least 10 polling samples. I could add more LUN criteria and/or more thresholds to further narrow down the results with AND or OR combinations.
Optionally, you can save the query so that it will be listed in the “Predefined Query” list in the future. Click Search and set or edit the name of the search.
After clicking OK, Analyzer will create a new tab and populate the results of the search. Once the search is complete you can graph metrics for the LUNs like normal. Here I’ve selected Utilization to show why this LUN matched the search criteria — note the high utilization between 2am and 7am.
You can get much more granular with your searches if you are looking for something specific, or use metrics like Response Time to look for poorly performing LUNs attached to a specific server. It’s pretty flexible. I started using the search feature recently and thought others might be interested in it. Try it out and let me know what you think.
<< Back to Part 4 — Part 5 — Go to Part 6 >>
Sorry for the delay on this next post.. Between EMC World and my 9 month old, it’s been a battle for time…
Okay, so you have an EMC Unified storage system (Clariion, Celerra, or VNX) with FASTCache and you’re wondering how FASTCache is helping you. Today I’m going to walk you through how to tease FASTCache performance data out of Analyzer.
I’m assuming you already have Analyzer launched and opened a NAR archive. One thing to understand about Analyzer stats as they relate to FASTCache, is that stats are gathered at the LUN level for traditional RAID Group LUNs, but for Pool based LUNs, the stats are gathered at the pool level. As a result graphing data for FASTCache differs for the two scenarios.
First we’ll take a look at the overall array performance. Here we’ll see how much of the write workload is being handled by FASTCache. In the SP Tab of Analyzer, select both SPs (be sure no LUNs or other objects are selected). Select Write Throughput (IO/s), and then click the clipboard icon (with I’s and O’s).
Launch Microsoft Excel and paste into the sheet, and then perform the text-to-column change discussed in the previous post if necessary.
Next create a formula in the D column, adding the values for both SPs into a single total. We’re not going to graph it quite yet though.
Back in Analyzer, deselect the two SPs, switch to the Storage Pool Tab, right-click on the array and choose Select All -> LUNs, then Select All -> Pools.
Click on a RAID Group LUN or Pool in the tree, it doesn’t matter which one, deselect Write Throughput (IO/s) and select FAST Cache Write Hits/s. In a moment, you’ll end up with a graph like this.
Click the clipboard icon again to copy this data and paste it into a new sheet of the same workbook in Excel. Insert a blank column between column A and B, then create a formula to add the values from column B through ZZ (ie: =SUM(C2:ZZ2).
Then copy that formula and paste into every row of column B. This column will be our Total FAST Cache Write Hits for the whole array. Finally, click the header for Column B to select it, then copy (CTRL-C). Back to the first sheet — Paste the “Values” (123 Icon) into Column E.
Now that we have the Total Write IOPS and Total FAST Cache Write Hits in adjacent columns of the same worksheet, we can graph them together. Select both columns (D and E in my example), click Insert, and choose 2D Area Chart. You’ll get a nice little graph that looks something like the following.
Since it’s a 2D Area Chart, and not a stacked graph, the FASTCache Write IOPS are layered over the Total Write IOPS such that visually it shows the portion of total IOPS handled by FASTCache. Follow this same process again for Read Throughput and FASTCache Read Hits. Furthur manipulation in Excel will allow you to look at total IOPS (read and write) or drill down to individual Pools or RAID Group LUNs.
Another thing to note when looking at FASTCache stats… FAST Cache Misses are IOPS that were not handled by FASTCache, but they may still have been handled by SP Cache. So in order to get a feel for how many read IOs are actually hitting the disks, you’d actually want to subtract SP Read Cache Hits and Total FASTCache Read Hits (calculated similar to the above example) from SP Read Throughput. This is similar for Write Cache Misses as well.
I hope this helps you better understand your FASTCache workload. I’ll be working on FASTVP next, which is quite a bit more involved.
<< Back to Part 4 — Part 5 — Go to Part 6 >>
Making Lemonade from Lemons.
In the last post, we looked at the storage processor statistics to check for cache health, excessive queuing, and response time issues and found that SPA has some performance degradation which seems to be related to write IO. Now we need to drill down on the individual LUNs to see where that IO is being directed. This is done in the LUN tab of Analyzer. First, right click on the storage array itself in the left pane and choose deselect all -> items. Then click the LUN tab and right click on the top level of the tree “LUNs”, choose select all -> LUNs. Click on one of the LUNs to highlight it, then in choose Write Throughput (IO/s) from the bottom pane. It may take a second for Analyzer to render the graph but you’ll end up with something like this…
You’ll quickly realize that this view doesn’t really help you figure out what’s going on. With many LUNs, there is simply too much data to display it this way. So click the clipboard button that has the I’s and O’s in it (next to the red arrow) to copy the graph data (in CSV format) into your desktop clipboard. Now launch Microsoft Excel, select cell A1 and type Ctrl-V to paste the data. It will look like the following image at first, with all LUNs statistics pasted into Column A.
Now we need to break out the various metrics into their own columns to make meaningful data, so go to the Data menu and click Text to Columns (see red arrow above). Select Delimited, click Next.. Select ONLY comma as the delimiter, then next, next, finish. Excel will separate the data into many columns (one column per LUN). Next we’ll create a graph that can actually tell us something. First, click the triangle button at the upper left corner of the sheet to select all of the data in the sheet at once. Then click the area chart icon, select Area, then the Stacked Area (see Red Arrows below) icon. Click OK.
You’ll get a nice little graph like this one below that is completely useless because the default chart has the X and Y axis reversed from what we need for Analyzer data.
To Fix this, right click on the graph, choose “Select Data”, click the Switch row/column button, and click OK.
Now you have a useful graph like the one below. What we are seeing here is each band of color representing the Write IOPS for a particular LUN. You’ll note that about 6 LUNs have very thick bands, and the rest of the over 100 LUNs have very small bands. In this case, 6 LUNs are driving more than 50% of the total write IOPS on the array. Since the column header in the Excel sheet has the LUN data, you can mouse over the color band to see which LUN it represents.
Now that you know where to look, you can go back to Analyzer, deselect all LUNs and drill down to the individual LUNs you need to look at. You may also want to look at the hosts that are using the busy LUNs to see what they are doing. In Analyzer, check the Write IO Size for the LUNs you are interested in and see if the size is in line with your expectations for the application involved. Very large IO sizes coupled with high IOPS (ie: high bandwidth) may cause write cache contention. In the case of this particular array, these 6 LUNs are VMFS datastores, and based on the Thin LUN space utilization and write IO loads, I would recommend that the customer convert them from Thin LUNs to Thick LUNs in the same Virtual Pool. Thick LUNs have better write performance and lower processor overhead compared with Thin LUNs and the amount of free space in these Thin LUNs is fairly small. This conversion can be done online with no host impact using LUN Migration.
You can use this copy/paste technique with Excel to graph all sorts of complex datasets from Analyzer that are pretty much not viewable with the default Analyzer graph. This process lets you select specific data or groups of metrics from an complete Analyzer archive and graph just the data you want, in the way you want to see it. There is also a way to do this as a bulk export/import, which can be scheduled too, and I’ll discuss that in the next post.
Disclaimer: Performance Analysis is an art, not a science. Every array is different, every application is different, and every environment has a different mix of both. These posts are an attempt to get you started in looking at what the array is doing and pointing you in a direction to go about addressing a problem. Keep in mind, a healthy array for one customer could be a poorly performing array for a different customer. It all comes down to application requirements and workload. Large block IO tends to have higher response times vs. small block IO for example. Sequential IO also has a smaller benefit from (and sometimes can be hindered by) cache. High IOPS and/or Bandwidth is not a problem, in fact it is proof that your array is doing work for you. But understanding where the high IOPS are coming from and whether a particular portion of the IO is a problem is important. You will not be able to read these series of posts and immediately dive in and resolve a performance problem on your array. But after reading these, I hope you will be more comfortable looking at how the system is performing and when users complain about a performance problem, you will know where to start looking. If you have a major performance issue and need help, open an case.
Starting from the top…
First let’s check the health of the front end processors and cache. The data for this is in the SP Tab which shows both of the SPs. The first thing I like to look at is the “SP Cache Dirty Pages (%)” but to make this data more meaningful we need to know what the write cache watermarks are set to. You can find this by right-clicking on the array object in the upper-left pane and choosing properties. The watermarks are shown in the SP Cache tab.
Once you note the watermarks, close the properties window and check the boxes for SPA and SPB. In the lower pane, deselect utilization and chose SP Cache Dirty Pages (%).
Dirty pages are pages in write cache that have received new data from hosts, but have not been flushed to disk. Generally speaking you want to have a high percentage of dirty pages because it increases the chance of a read coming from cache or additional writes to the same block of data being absorbed by the cache. Any time an IO is served from cache, the performance is better than if the data had to be retrieved from disk. This is why the default watermarks are usually around 60/80% or 70/90%.
What you don’t want is for dirty pages to reach 100%. If the write cache is healthy, you will see the dirty pages value fluctuating between the high and low watermarks (as SPB is doing in the graph). Periodic spikes or drops outside the watermarks are fine, but repeatedly hitting 100% indicates that the write cache is being stressed (SPA is having this issue on this system). The storage system compensates for a full cache by briefly delaying host IO and going into a forced flushing state. Forced Flushes are high priority operations to get data moved out of cache and onto the back end disks to free up write cache for more writes. This WILL cause performance degradation. Sustained Large Block Write IO is a common culprit here.
While we’re here, deselect Dirty Pages (%) and select Utilization (%) and look for two things here:
1.) Is either SP running at a load of higher than 70%? This will increase application response time. Check whether the SPs seem to fluctuate with the business day. For non-disruptive upgrades, both SPs need to be under 50% utilization.
2.) Are the two SPs balanced? If one is much busier than the other that may be something to investigate.
Now look at Response time (ms) and make sure that, again, both SPs are relatively even, and that Response time is within reasonable levels. If you see that one SP has high utilization and response time but the other SP does not, there may be a LUN or set of LUNs owned by the busy SP that are consuming more array resources. Looking at Total Throughput and Total Bandwidth can help confirm this, and then graphing Read vs. Write Throughput and Bandwidth to see what the IO operations actually are. If both SPs have relatively similar throughput but one SP has much higher bandwidth, then there is likely some large block IO occurring that you may want to track down.
As an example, I’ve now seen two different customers where a Microsoft Sharepoint server running in a virtual machine (on a VMFS datastore) had a stuck process that caused SQL to drive nearly 200MB/sec of disk bandwidth to the backend array. Not enough to cause huge issues, but enough to overdrive the disks in that LUN’s RAID Group, increasing queue length on the disks and SP, which in turn increased SP utilization and response time on the array. This increased response time affected other applications unrelated to Sharepoint.
Next, let’s check the Port Queue Full Count. This is the number of times that a front end port issued a QFULL response back to the hosts. If you are seeing QFULL’s there are two possible causes.. One is that the Queue Depth on the HBA is too large for the LUNs being accessed. Each LUN on the array has a maximum queue depth that is calculated using a formula based on the number of data disks in the RAID Group. For example, a RAID5 4+1 LUN will have a queue depth of 88. Assuming your HBA queue depth is 64 then you won’t have a problem. However, if the LUN is used in a cluster file system (Oracle ASM, VMWare VMFS, etc) where multiple hosts are accessing the LUN simultaneously, you could run into problems here. Reducing the HBA Queue Depth on the hosts will alleviate this issue.
The second cause is when there are many hosts accessing the same front end ports and the HBA Execution Throttle is too large on those hosts. A Clariion/VNX front end port has a queue depth of 1600 which is the maximum number of simultaneous IO’s that port can process. If there are 1600 IOs in queue and another IO is issued, the port responds with QFULL. The host HBA responds by lowering its own Queue Depth (per LUN) to 1 and then gradually increasing the queue depth over time back to normal. An example situation might be 10 hosts, all driving lots of IO, with HBA Execution Throttle set to 255. It’s possible that those ten hosts can send a total of 2550 IOs simultaneously. If they are all driving that IO to the same front end port, that will flood the port queue. Reducing the HBA Execution throttle on the hosts will alleviate this issue.
Looking at the Port Throughput, you can see here that 2 ports are driving the majority of the workload. This isn’t necessarily a problem by itself, but PowerPath could help spread the load across the ports which could potentially improve performance.
In VMWare environments specifically, it is very common to see many hosts all accessing many LUNs over only 1 or 2 paths even though there may be 4 or 8 paths available. This is due to the default path selection (lowest port) on boot. This could increase the chances of a QFULL problem as mentioned above or possibly exceeding the available bandwidth of the ports. You can manually change the paths on each LUN on each host in a VMWare cluster to balance the load, or use Round-Robin load balancing. PowerPath/VE automatically load balances the IO across all active paths with zero management overhead.
Another thing to look for is an imbalance of IO or Bandwidth on the processors. Look specifically at Write Throughput and Write Bandwidth first as writes have the most impact on the storage system and more specifically the write cache. As you can see in this graph, SPA is processing a fair bit more write IOPS compared to SPB. This correlates with the high Dirty Pages and Response Time on SPA in the previous graphs.
So we’ve identified that there is performance degradation on SPA and that it is probably related to Write IO. The next step is to dig down and find out if there are specific LUNs causing the high write load and see if those could be causing the high response times.
Okay, so you’ve got the Analyzer enabler on your array and enabled logging, and you’ve installed Unisphere Server, Unisphere Client, and Microsoft Excel on your workstation. Next step is to download a NAR file from the array. In Navisphere, right click on the array, go to the Analyzer menu and retrieve an archive. You can get the archive from either SP of the array, both have the same data. You will eventually see multiple NAR files, each covering some period of time. Retrieve the one for the period of time you want to look at. You can also merge multiple files together to get larger time periods into a single analyzer session. In Unisphere, the process is essentially the same, select the array, go to Monitoring -> Analyzer.
You’ve got your workstation set up and you have a NAR file downloaded to your workstation. Let’s get to it. Launch Unisphere Client from the Start Menu and connect to “localhost” when prompted. Login to Unisphere. You’ll see something like this…
In the drop down menu change to the “Unisphere Server – 127.0.0.1” which will change the main screen to Event Notification most likely. Click on Monitoring, then Analyzer.
Let’s set some defaults before we open a NAR file.
- In the left pane, click Customize Charts
- In the General Tab, check the Advanced box so we can see more detailed metrics in Analyzer
- In the Archive Tab, under Analyzer, select Performance Detail and make sure Initially Check All Tree Objects is unchecked.
- Click OK to save.
In the right pane, click on Open Archive , browse to the NAR file you want to view and open.
Because the NAR file can contain many hours (sometimes multiple days) or performance data, you will be prompted to set a time range. The default times will show all data available in the archive. If you want to narrow down to a smaller time range, change the Graph Start and End times, otherwise just click OK.
The Performance Detail window will launch and the LUN tab will be selected. No items should be selected and as such no data will be graphed.
My personal methodology is to take a top-down approach when it comes to performance analysis and troubleshooting.
- Check the SP’s, Cache, and SP Ports for obvious issues. If a user is complaining of poor performance the Cache is usually the first place I look.
- Drill down to RAID Groups, Pools, and LUNs to find the culprits
- Drill down to the physical disk level if necessary
- Export data to Excel for better graphs that make it easier to see whats happening
- Do you have an application owner complaining about performance?
- Do you want to get a general idea of how your array is performing?
- Do you want to turn this.. into this..?
I’ve been doing a lot of performance analysis with EMC Clariion CX3, CX4, and VNX storage recently and have a sort of an informal methodology I follow. I’ve had a couple customers ask me to show them how to get useful data and graphs from their arrays and more recently after posting about FASTCache and FASTVP results I’ve had even more queries on the topic. So I’ve decided to put together a sort of how-to guide. It will take several posts to go through the whole process, so this first post will focus on making sure you have the right tools.
First, you MUST have the Navisphere/Unisphere Analyzer enabler on the storage array. If you don’t have it, all you can really do is send an encrypted archive to EMC for help when you have a performance problem. Analyzer is an indispensable performance analysis tool for CX/VNX systems and is really quite powerful. Unfortunately, many customers don’t see the value during the purchase process but end up needing it someday in the future. Make sure Analyzer is included in EVERY array purchase.
If you haven’t already, you also need to enable Statistics on the array AND in more recent versions of FLARE you need to enable Archive Logging. Statistics logging is enabled in the array properties dialog, shown here…
Archive Logging is enabled in the Monitoring -> Analyzer -> Data Logging dialog, shown here…
In practice, 5 minutes is a good interval for archives. Also make sure that periodic archiving is enabled which will generate a new NAR file every so often (it depends on the interval)
Next, you need an Analyzer workstation. You can run Analyzer directly off an array through Navisphere Manager or Unisphere but I prefer installing the software directly on my PC. It lets me work on the analysis from home or anywhere else, and since I look at data from many different customer’s arrays’ it’s easier. You can download the latest version of Unisphere Server and Unisphere Client directly from PowerLink (Home > Support > Software Downloads and Licensing > Downloads T-Z > Unisphere Server Software). Once you install both, you can launch the client and log in to your local Unisphere server. You can then open Analyzer archive files (NAR files) from any array for analysis.
Third, you need a graphing tool. I currently use Microsoft Excel 2010 on the same workstation as my Unisphere installation, which happens to be my corporate laptop. While Analyzer does graph the data you select, there is only one type of graph available and sometimes when many objects are being graphed together it’s almost impossible to actually compare them to each other.
Another reason to use Excel is that while Analyzer has a wealth of different statistics available for all sorts of array objects, there are some exceptions right now. For example, if you are using newer features such as FASTCache or FASTVP on your array and want to see statistics for those technologies, there is not much in Analyzer to see. I’ll go through some methods for teasing that data out as well.
Part 1 — Go to Part 2 >>
I have a customer who just recently upgraded their EMC Celerra NS480 Unified Storage Array (based on Clariion CX4-480) to FLARE30 and enabled FASTCache across the array, as well as FASTVP automated tiering for a large amount of their block data. Now that it’s been configured and the customer has performed a large amount of non-disruptive migrations of data from older RAID groups and VP pools into the newer FASTVP pool, including thick-to-thin conversions, I was able to get some performance data from their array and thought I’d share these results.
This is Real-World data
This is NOT some edge case where the customer’s workload is perfect for FASTCache and FASTVP and it’s also NOT a crazy configuration that would cost an arm and a leg. This is a real production system running in a customer datacenter, with a few EFDs split between FASTCache and FASTVP and some SATA to augment capacity in the pool for their existing FC based LUNS. These are REAL results that show how FASTVP has distributed the IO workload across all available disks and how a relatively small amount of FASTCache is absorbing a decent percentage of the total array workload.
This NS480 array has nearly 480 drives in total and has approximately 28TB of block data (I only counted consumed data on the thin LUNs) and about 100TB of NAS data. Out of the 28TB of block LUNs, 20TB is in Virtual Pools, 14TB of which is in a single FASTVP Pool. This array supports the customers’ ERP application, entire VMWare environment, SQL databases, and NAS shares simultaneously.
In this case FASTCache has been configured with just 183GB of usable capacity (4 x 100GB EFD disks) for the entire storage array (128TB of data) and is enabled for all LUNs and Pools. The graphs here are from a 4 hour window of time after the very FIRST FASTVP re-allocation completed using only about 1 days’ worth of statistics. Subsequent re-allocations in the FASTVP pool will tune the array even more.
First, let’s take a look at the array as a whole, here you can see that the array is processing approximately ~10,000 IOPS through the entire interval.
FASTCache is handling about 25% of the entire workload with just 4 disks. I didn’t graph it here but the total array IO Response time through this window is averaging 2.5 ms. The pools and RAID Groups on this array are almost all RAID5 and the read/write ratio averages 60/40 which is a bit write heavy for RAID5 environments, generally speaking.
If you’ve done any reading about EMC FASTCache, you probably know that it is a read/write cache. Let’s take a look at the write load of the array and see how much of that write load FASTCache is handling. In the following graph you can see that out of the ~10,000 total IOPS, the array is averaging about 2500-3500 write IOPS with FASTCache handling about 1500 of that total.
That means FASTCache is reducing the back-end writes to disk by about 50% on this system. On the NS480/CX4-480, FASTCache can be configured with up to 800GB usable capacity, so this array could see higher overall performance if needed by augmenting FASTCache further. Installing and upgrading FASTCache is non-disruptive so you can start with a small amount and upgrade later if needed.
FASTVP and FASTCache Together
Next, we’ll drill down to the FASTVP pool which contains 190 total disks (5 x EFD, 170 x FC, and 15 x SATA). There is no maximum number of drives in a Virtual Pool on FLARE30 so this pool could easily be much larger if desired. I’ve graphed the IOPS-per-tier as well as the FASTCache IOPS associated with just this pool in a stacked graph to give an idea of total throughput for the pool as well as the individual tiers.
The pool is servicing between 5,000 and 8,000 IOPS on average which is about half of the total array workload. In case you didn’t already know, FASTVP and FASTCache work together to make sure that data is not duplicated in EFDs. If data has been promoted to the EFD tier in a pool, it will not be promoted to FASTCache, and vise-versa. As a result of this intelligence, FASTCache acceleration is additive to an EFD-enabled FASTVP pool. Here you can see that the EFD tier and FASTCache combined are servicing about 25-40% of the total workload, the FC tier another 40-50%, and the SATA tier services the remaining IOPS. Keep in mind that FASTCache is accelerating IO for other Pools and RAID Group LUNs in addition to this one, so it’s not dedicated to just this pool (although that is configurable.)
FASTVP IO Distribution
Lastly, to illustrate FASTVP’s effect on IO distribution at the physical disk layer, I’ve broken down IOPS-per-spindle-per-tier for this pool as well. You can see that the FC disks are servicing relatively low IO and have plenty of head room available while the EFD disks, also not being stretched to their limits, are servicing vastly more IOPS per spindle, as expected. The other thing you may have noticed here is that the EFDs are seeing the majority of the workload’s volatility, while the FC and SATA disks have a pretty flat workload over time. This illustrates that FASTVP has placed the more bursty workloads on EFD where they can be serviced more effectively.
Hopefully you can see here how a very small amount of EFDs used with both FASTCache and FASTVP can relieve a significant portion of the workload from the rest of the disks. FASTCache on this system adds up to only 0.14% of the total data set size and the EFD tier in the FASTVP pool only accounts for 2.6% of the total dataset in that pool.
What do you think of these results? Have you added FASTCache and/or FASTVP to your array? If so, what were your results?
It’s been a little while since I’ve posted, mostly due to my life being turned on it’s rear after our first child was born 8 weeks ago. As things start to settle into a rhythm (as much as is possible) I’ve been back online more, reading blogs, following Twitter, and working with customers regularly. As some of you may know, EMC announced support for pNFS in Celerra with the release of DART 6.x and there have been several recent posts about the technology which piqued my interest a little.
- Chuck Hollis – I Want My pNFS
- Chuck Hollis – More on pNFS
- Storagebod – Deja Vu
- Chad Sakac – pNFS – it’s here! (Almost!)
- Steve Foskett – Is NFSv3 really that bad?
- Storagezilla – NFSv4 vs NFSv4? FIGHT!
The other bloggers have done a good job of describing what pNFS is and what is new in NFS4.1 itself so I won’t repeat all of that. I want to focus specifically on pNFS and why it IS a big deal.
Prior to my coming to work for EMC, I worked in internal IT at company that deals with large binary files in support of product development, as well as video editing for marketing purposes. I had a chance to evaluate, implement, and support multiple clustered file system technologies. The first was for an HD video editing solution using Mac’s and we followed the likely path of implementing Apple’s XSAN solution which you may know is an OEM’d version of Quantum(ADIC) StorNext. StorNext allows you to create large filesystems across many disks and access them as local disk on many clients. File Open, Close, byte-range locking, etc are handled by MetaData Controllers (MDCs) across an IP network while the actual heavy lifting of read/write IO is done over FibreChannel from the clients to the storage directly. All the shared filesystem benefits of NAS with the performance benefits of SAN.
The second project was specifically targeted at moving large files (4+GB each) through a workflow across many computers as quickly as possible so we could ship products. Faster processing of the workflow translated to more completed projects per person/per day which meant better margins and keeping our partners and customers happy. The workflow was already established, using Windows based computers and a file server. The file server was running out of steam and the amount of data being stored at any given time had increased from 500GB to 8TB over the past 12 months. We needed a simple way to increase the performance of the file server and also allow for better scalability. Working with our local EMC SE, we tested and deployed MPFSi using a Celerra NS40 with integrated storage.
MPFS has been around a long time (also known as High Road) and works with Windows and various *nix based platforms. It is similar to XSAN/StorNext in that open/close/locking activity is handled over IP by the metadata controller (the Celerra datamover in the case of MPFS) while the read/write IO is handled over block storage technology (MPFS supports FibreChannel and iSCSI connectivity to storage). The advantage of MPFS over many other solutions is that the metadata controller and storage are all built-in to the EMC Celerra storage device and you don’t have to deploy any other servers.
In our case we chose iSCSI due to the cost of FC (switches and HBAs) and used the GigE ports on the Celerra’s CX3 backend for block connectivity. In testing we showed that CIFS alone provided approximately 240mbps of throughput over GigE connections while enabling MPFSi netted about 750mbps, even if we used the same NIC on the client. So we tripled throughput over the same LAN by installing a software client. Had we gone the extra mile to deploy FibreChannel for the block IO we would have seen much higher throughput.
Even better, the use of MPFS did not preclude the use of NDMP for backup to tape directly from the Celerra, accelerating backup many times over the old fileserver. For clients that did not have MPFS software installed, they accessed the same files over traditional CIFS with no problems. Another side benefit of MPFS over traditional CIFS, is that the block I/O stack is much more efficient than the NAS I/O stack so even with increased throughput, CPU utilization is lower on the client returning cycles to the application which is doing work for your business.
There are many clustered file system / clustered NAS solutions on the market from a variety of vendors (StorNext, MPFS, GFS, Polyserve, etc) and most of these products are trying to solve the same basic problems of storing more data and increasing performance. The problem is they are all proprietary and because of that you end up with multiple solutions deployed in the same company. In our case we couldn’t use MPFS for the video editing solution because EMC has not provided a client for Mac OSX. And this is where pNFS really becomes attractive. Storage vendors and operating system vendors alike will be upgrading the already ubiquitous NFS stack in their code to support NFS4.1 and pNFS. And that support means that I could deploy an EMC Celerra MPFS like solution using the same Celerra based storage, with no extra servers, and no special client software, just the native NFS client in my operating system of choice. Perhaps Apple will include a pNFS capable client in a future version of Mac OSX.
If you look at the pNFS standard you’ll see that it supports the use of not only block storage, but object and file based storage as well. So as we build out larger and larger environments and private clouds start to expand into public clouds you could tier your pNFS data across FiberChannel storage, object storage (think Atmos on premises), as well as out to a service provider cloud (ie: AT&T Synaptic). Now you’ve dramatically increased performance for the data that needs it, saved money storing the data that you need to keep long term, and geographically dispersed the data that needs to be close to users, with a single protocol supported by most of the industry and a single point of management.
Personally I think pNFS could kill off proprietary solutions over the long run unless they include support for it in their products.
This is just my opinion of course…
A comment about HDS’s Zero Page Reclaim on one of my previous posts got me thinking about the effectiveness of thin provisioning in general. In that previous post, I talked about the trade-offs between increased storage utilization through the use of thin-provisioning and the potential performance problems associated with it.
There are intrinsic benefits that come with the use of thin provisioning. First, new storage can be provisioned for applications without nearly as much planning. Next, application owners get what they want, while storage admins can show they are utilizing the storage systems effectively. Also, rather than managing the growth of data in individual applications, storage admins are able to manage the growth of data across the enterprise as a whole.
Thin provisioning can also provide performance benefits… For example, consider a set of virtual Windows servers running across several LUNs contained in the same RAID group. Each Windows VM stores its OS files in the first few GB of their respective VMDK files. Each VMDK file is stored in order in each LUN, with some free space at the end. In essence, we have a whole bunch of OS sections separated by gaps of no data. If all VMs were booting at approximately the same time, the disk heads would have to move continuously across the entire disk, increasing disk latency.
Now take the same disks, configured as a thin pool, and create the same LUNs (as thin LUNs) and the same VMs. Because thin-provisioning in general only writes data to the physical disks as it’s being written by the application, starting from the beginning of the disk, all of those Windows VMs’ OS files will be placed at the beginning of the disks. This increased data locality will reduce IO latency across all of the VMs. The effect is probably minor, but reduced disk latency translates to possibly higher IOPS from the same set of physical disks. And the only change is the use of thin-provisioning.
So back to HDS Zero Page Reclaim. The biggest problem with thin provisioning is that it doesn’t stay thin for long. Windows NTFS, for example, is particularly NOT thin-friendly since it favors previously untouched disk space for new writes rather than overwriting deleted files. This activity eventually causes a thin-LUN to grow to it’s maximum size over time, even though the actual amount of data stored in the LUN may not change. And Windows isn’t the only one with the problem. This means that thin provisioning may make provisioning easier, or possibly improve IO latency, but it might not actually save you any money on disk. This is where HDS’s Zero Page Reclaim can help. Hitachi’s Dynamic Provisioning (with ZPR) can scan a LUN for sections where all the bytes are zero and reclaim that space for other thin LUNs. This is particularly useful for converting thick LUNs into thin LUNs. But, it can only see blocks of zeros, and so it won’t necessarily see space freed up by deleting files. Hitachi’s own documentation points out that many file systems are not-thin friendly, and ZPR won’t help with long-term growth of thin LUNs caused by actively writing and then deleting data.
Although there are ways to script the writing of zeros to free space on a server so that ZPI can reclaim that space, you would need to run that script on all of your servers, requiring a unique tool for each operating system in your environment. The script would also have to run periodically, since the file system will grow again afterward.
NetApp’s SnapDrive tool for Windows can scan an NTFS file system, detect deleted files, then report the associated blocks back to the Filer to be added back to the aggregate for use by other volumes/LUNs. The Space Reclamation scan can be run as needed, and I believe it can be scheduled; but, it appears to be Windows only. Again, this will have to be done periodically.
But what if you could solve the problem across most or all of your systems, regardless of operating system, regardless of application, with real-time reclamation? And what if you could simultaneously solve other problems? Enter Symantec’s Storage Foundation with Thin-Reclamation API. Storage Foundation consists of VxFS, VxVM, DMP, and some other tools that together provide dynamic grow/shrink, snapshots, replication, thin-friendly volume usage, and dynamic SAN multipathing across multiple operating systems. Storage Foundation’s Thin-Reclamation API is to thin-provisioning what OST is to Backup Deduplication. Storage vendors can now add near-real-time zero page reclaim for customers that are willing to deploy VxFS/VxVM on their servers. For EMC customers, DMP can replace PowerPath, thereby offsetting the cost.
As far as I know, 3PAR is the first and only storage vendor to write to Symantec’s thin-API, which means they now have the most dynamic, non-disruptive, zero-page-reclaim feature set on the market. As a storage engineer myself, I have often wondered if VxVM/VxFS could make management of application data storage in our diverse environment easier and more dynamic. Adding Thin-Reclamation to the mix makes it even more attractive. I’d like to see more storage vendors follow 3PAR’s lead and write to Symantec’s API. I’d also like to see Symantec open up both OST and the Thin-Reclamation API for others to use, but I doubt that will happen.