Tip #4: Graphing Performance data from NetApp w/ Powershell – Part 1

Your NetApp DFM (OnCommand Core) Performance Advisor is lying to you… It doesn’t want to. And it doesn’t mean to. But it is.  This was an issue that a customer came to me with yesterday.

If you asked your NetApp or DFM, “Hey dude, what’s my CPU utilization?” DFM may reply back, “Ohh man, you’re screwed, it’s 95%!” You instantly crap yourself since you just paid big money for all this stuff, which should last for years.  Don’t worry, more than likely you’re good.

Why is it lying to you?

First, we need to set things straight about what “counters” are.

Counters are performance statistics about the system. You can get deep, very very deep into how much NetApp will tell you. Sometimes it’s information overload. You have dozens of statistics for each and every thing such as luns, volumes, and cpu.

For example, 1 cpu on an 8 core san controller tracks this data nonstop, and DFM stores it every minute or less intervals:

Thats alot of data. Now imagine how much DFM/OnCommandCore/OpsManager and the set actually store. Wow.

Notice the “domain_busy” up there. WTF. Well, if you don’t know what domains refer to above, they are groups of CPUs.

If you were to ask Operations Manager or DFM Performance Advisor “HAL, Show me my CPU utilization for the last 3 months!” What is would actually show you is one specific counter ‘system:system:cpu_busy’.

This is the same counter used if on the filer’s command prompt as CPU and sysstat. Look below. Notice the CPU value in sysstat 1. 43-58%. Really? I don’t think so. Now, lets look at the actual per processor stats with sysstat -m 1. Wtf? Why is the ANY nothing like the AVG or the real usage?

 

AVG is actually more like what we want.  That’s the REAL average of the CPUs.  NetApp chooses to be a big ol’ pessimist and in ANY is uses the system:cpu_busy metric.

cpu_busy Percentage of time one or more processors is busy in the system

Note: For systems running Data ONTAP 7.2 or earlier, the cpu_busy counter is the amount of time that any one CPU is busy. This results in a value for cpu_busy that is inflated. For systems running Data ONTAP 7.2.1 or later, the cpu_busy counter is the greater of either average CPU utilization or the busiest domain.

The BUSIEST DOMAIN… so if you have 1 cpu or 1 domain flipping out, you’re metrics will be screwed.  Just for the record, I have NEVER seen ANY use anything near the AVG on a large or highly utilized filer.

Want a little more insight into what is being used?  Time for some unsupported lovin…

Nifty! Now I can see more of what is being used. But, that does not help me with my task at hand.  I need to get the REAL average cpu utilization of my system for the last 3 months.    system:system:avg_processor_busy to the rescue!

avg_processor_busy Average processor utilization across all processors in the system

You can test different counters in the netapp directly to see if it gives you what you want.

 

Yep, that’s more like what I want. Sweet.

Now, I have DFM (now called OnCommand Core) and within it, I have had Performance Advisor gathering data for a good long while.

Let’s see if I have the data I want.  I want to know the MAXIMUM counter for every 60 minute interval of the 1 minute averages.  This should give me a good baseline.   Let me check my base DFM output, on my DFM host.

 

Awesome, looks like it is working. Note let’s wrap that in some powershell to make it open an excel and make a scatter chart. Save the following to get_max_avg_cpu.ps1

Run the ps1 with your filername as it is saved in dfm/opsmanager/oncommand/etc.

There you have it. An easy, albeit ugly, graph of your actual avg cpu usage on your filer!

 

[asa]0596801505[/asa]

Be Sociable, Share!

, , ,