Your NetApp DFM (OnCommand Core) Performance Advisor is lying to you… It doesn’t want to. And it doesn’t mean to. But it is. This was an issue that a customer came to me with yesterday.
If you asked your NetApp or DFM, “Hey dude, what’s my CPU utilization?” DFM may reply back, “Ohh man, you’re screwed, it’s 95%!” You instantly crap yourself since you just paid big money for all this stuff, which should last for years. Don’t worry, more than likely you’re good.
Why is it lying to you?
First, we need to set things straight about what “counters” are.
Counters are performance statistics about the system. You can get deep, very very deep into how much NetApp will tell you. Sometimes it’s information overload. You have dozens of statistics for each and every thing such as luns, volumes, and cpu.
For example, 1 cpu on an 8 core san controller tracks this data nonstop, and DFM stores it every minute or less intervals:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
processor:processor0:processor_busy:4% processor:processor0:domain_busy.idle:96% processor:processor0:domain_busy.kahuna:0% processor:processor0:domain_busy.storage:0% processor:processor0:domain_busy.exempt:0% processor:processor0:domain_busy.raid:0% processor:processor0:domain_busy.target:0% processor:processor0:domain_busy.netcache:0% processor:processor0:domain_busy.netcache2:0% processor:processor0:domain_busy.cifs:0% processor:processor0:domain_busy.wafl_exempt:0% processor:processor0:domain_busy.wafl_xcleaner:0% processor:processor0:domain_busy.sm_exempt:0% processor:processor0:domain_busy.cluster:0% processor:processor0:domain_busy.protocol:0% processor:processor0:domain_busy.nwk_exclusive:0% processor:processor0:domain_busy.nwk_exempt:0% processor:processor0:domain_busy.nwk_legacy:0% processor:processor0:domain_busy.nwk_ctx1:0% processor:processor0:domain_busy.nwk_ctx2:0% processor:processor0:domain_busy.nwk_ctx3:0% processor:processor0:domain_busy.nwk_ctx4:0% processor:processor0:domain_busy.hostOS:1% |
Thats alot of data. Now imagine how much DFM/OnCommandCore/OpsManager and the set actually store. Wow.
Notice the “domain_busy” up there. WTF. Well, if you don’t know what domains refer to above, they are groups of CPUs.
If you were to ask Operations Manager or DFM Performance Advisor “HAL, Show me my CPU utilization for the last 3 months!” What is would actually show you is one specific counter ‘system:system:cpu_busy’.
This is the same counter used if on the filer’s command prompt as CPU and sysstat. Look below. Notice the CPU value in sysstat 1. 43-58%. Really? I don’t think so. Now, lets look at the actual per processor stats with sysstat -m 1. Wtf? Why is the ANY nothing like the AVG or the real usage?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
netapp1*> sysstat 1 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 43% 34 0 0 512 26 76204 50960 0 0 2s 46% 190 0 0 11031 354 73632 24 0 0 2s 55% 124 0 0 539 6492 124696 191992 0 0 3s 46% 162 0 0 527 6249 80700 84700 0 0 31s 58% 73 0 0 264 1904 70816 40448 0 0 7s netapp1*> sysstat -m 1 ANY AVG CPU0 CPU1 CPU2 CPU3 58% 22% 16% 18% 15% 39% 56% 21% 15% 16% 13% 38% 74% 36% 30% 27% 35% 50% 62% 25% 18% 17% 21% 43% 49% 19% 13% 15% 22% 26% 54% 18% 13% 10% 11% 40% |
AVG is actually more like what we want. That’s the REAL average of the CPUs. NetApp chooses to be a big ol’ pessimist and in ANY is uses the system:cpu_busy metric.
cpu_busy Percentage of time one or more processors is busy in the system
Note: For systems running Data ONTAP 7.2 or earlier, the cpu_busy counter is the amount of time that any one CPU is busy. This results in a value for cpu_busy that is inflated. For systems running Data ONTAP 7.2.1 or later, the cpu_busy counter is the greater of either average CPU utilization or the busiest domain.
The BUSIEST DOMAIN… so if you have 1 cpu or 1 domain flipping out, you’re metrics will be screwed. Just for the record, I have NEVER seen ANY use anything near the AVG on a large or highly utilized filer.
Want a little more insight into what is being used? Time for some unsupported lovin…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
netapp1*> priv set diag Warning: These diagnostic commands are for use by NetApp personnel only. (hehehe yeah right...) netapp1*> sysstat -M 1 ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 29% 12% 3% 1% 12% 9% 13% 14% 11% 1% 0% 0% 4% 6% 4% 4% 14%( 11%) 0% 0% 0% 12% 2% 1% 1412 100% 12% 5% 1% 0% 5% 3% 6% 8% 3% 1% 0% 0% 3% 5% 0% 2% 1%( 1%) 0% 0% 0% 5% 1% 1% 138 100% 11% 3% 0% 0% 4% 2% 5% 5% 4% 1% 0% 0% 3% 3% 1% 2% 1%( 1%) 0% 0% 0% 3% 1% 1% 172 58% 10% 2% 0% 0% 3% 2% 4% 3% 5% 1% 0% 0% 3% 1% 1% 3% 2%( 2%) 0% 0% 0% 1% 1% 1% 367 0% 7% 1% 0% 0% 2% 2% 4% 2% 2% 1% 0% 0% 3% 1% 1% 1% 1%( 1%) 0% 0% 0% 0% 1% 1% 252 0% 25% 7% 2% 0% 9% 7% 10% 9% 11% 1% 0% 0% 3% 2% 4% 3% 11%( 11%) 0% 0% 0% 8% 2% 1% 1907 0% 30% 8% 1% 0% 10% 8% 10% 10% 14% 3% 0% 0% 3% 1% 5% 4% 15%( 13%) 0% 0% 0% 8% 2% 1% 2613 0% 18% 6% 1% 0% 7% 5% 7% 6% 10% 11% 0% 0% 2% 1% 0% 6% 4%( 4%) 0% 0% 0% 0% 2% 1% 920 0% 8% 1% 0% 0% 3% 2% 4% 3% 3% 1% 0% 0% 3% 1% 1% 2% 1%( 1%) 0% 0% 0% 1% 1% 1% 351 0% |
Nifty! Now I can see more of what is being used. But, that does not help me with my task at hand. I need to get the REAL average cpu utilization of my system for the last 3 months. system:system:avg_processor_busy to the rescue!
avg_processor_busy Average processor utilization across all processors in the system
You can test different counters in the netapp directly to see if it gives you what you want.
1 2 3 4 5 6 7 |
extfs02> stats show -i 1 system:system:avg_processor_busy Instance avg_processo % system 3 system 8 system 14 system 8 |
Yep, that’s more like what I want. Sweet.
Now, I have DFM (now called OnCommand Core) and within it, I have had Performance Advisor gathering data for a good long while.
Let’s see if I have the data I want. I want to know the MAXIMUM counter for every 60 minute interval of the 1 minute averages. This should give me a good baseline. Let me check my base DFM output, on my DFM host.
1 2 3 4 5 6 7 |
PS H:\> dfm perf data retrieve -o netapp1 -C system:avg_processor_busy -d 36000 -S simple -m max -s 600 Timestamp netapp1:avg_processor_busy ------------------------------------------------------------------------------- 2012-01-20 06:50:53 0.700 2012-01-20 07:00:53 1.655 2012-01-20 07:10:53 0.690 ... truncated ... |
Awesome, looks like it is working. Note let’s wrap that in some powershell to make it open an excel and make a scatter chart. Save the following to get_max_avg_cpu.ps1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# ################## get_max_avg_cpu.ps1 ########### # Quick and very dirty graphing of DFM data by JK-47.com #################################################### # Make an empty hashtable $foo = @{} # Read in the cli filer name $filername=$args[0] #Populate a var with the dfm performance data $DFMdata = dfm perf data retrieve -o $filername -C system:avg_processor_busy -d 7889231 -S simple -m max -s 3600 | select-object -skip 3 # Do some excel voodoo $excel=New-Object -COM "Excel.Application" $excel.Visible=$true $excel.Usercontrol=$true $Workbook=$excel.Workbooks.add() # Add a Worksheet $Worksheet=$Workbook.Worksheets.Item(1) # Split the DFM data into 2 columns. $row=1 $DFMdata | % { $s = $_.Split("`t") $Workbook.ActiveSheet.Cells.Item($row,1).Value2 = $s[0] $Workbook.ActiveSheet.Cells.Item($row,2).Value2 = $s[1] $row++ } # Add a chart of the active data $objRange=$Worksheet.UsedRange $colCharts=$excel.Charts $objChart=$colCharts.Add() $objChart.ChartType=75 $a=$objChart.Activate |
Run the ps1 with your filername as it is saved in dfm/opsmanager/oncommand/etc.
1 |
PS H:\> .\get_max_avg_cpu.ps1 myfiler1.jk-47.com |
There you have it. An easy, albeit ugly, graph of your actual avg cpu usage on your filer!
[asa]0596801505[/asa]
Wow, awesome article, thanks for sharing your experience, this is really helpful.