Just an FYI when you jump from Ontap 7-Mode 7.3.x to 8.1.4 you can have a re-occurrence of the Netapp bug 568758 which had to do with block deletes killing performance and spiking CPU due to serialization of volume cleanup processes. (Even though NetApp ‘fixed’ this bug in 8.1.4 from happening new, the snaps on the volumes in a certain way can cause it to crop up. Perfect Storm. )
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=568758
I have ran into something totally new which is really based on something old with that bug…
I just did a headswap from FAS3140 to FAS3220, and what I thought was high CPU due to dedupe runs and Fingerprint updating, at first, ended up lasting through the morning. The customer called me in a panic before I headed out of town.
High CPU, as seen in System Manager’s performance view. Sitting at 100% all morning. Even though real AVG is just 33% and the issue isn’t a major meltdown.
I never ever ever trust any CPU information because most collect the wrong counter to be useful. ok let’s look deeper.
1 2 3 4 |
netapp2> priv set diag sysstat -m 1 ANY AVG CPU0 CPU1 CPU2 CPU3 100% 33% 22% 25% 20% 66% 100% 33% 22% 22% 18% 70% |
Well, now my customer is all freaked to hell. CPU3 is pegged and the “ANY” or CPU Domain is 100%. Let’s look deeper wtf is going on!
1 2 3 4 |
netapp2*> sysstat -M 1 ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 100% 21% 3% 1% 34% 22% 27% 23% 62% 13% 0% 0% 1% 1% 3% 1% 103%( 99%) 0% 0% 0% 1% 7% 4% 901 0% 100% 16% 2% 0% 31% 20% 18% 17% 70% 7% 0% 0% 1% 1% 2% 3% 100%( 97%) 0% 0% 0% 1% 6% 4% 587 0% |
Huh? no raid or kahuna but Kahuna exempt [WAFL_Ex(Kahu)] cpu is pegged at 103%?
Let’s look deeper!
I looked to see if throttling settings were set right. This has cause major CPU issues on other systems in the past.
1 2 3 4 5 |
options wafl.trunc.throttle.hipri.enable off options wafl.trunc.throttle.slice.enable off options wafl.trunc.throttle.system.max 30 options wafl.trunc.throttle.vol.max 30 options wafl.trunc.throttle.min.size 1530 |
They look great.
How’s wafl scan status? Perfect. No scans.
Hows aggr status? Perfect, in RLW_Upgrading but no active scrubs
How’s options raid.scrub.perf_impact ? low.
We’ll crap. What could be wrong.
1 2 3 4 5 6 7 8 |
netapp2*> ps -c 5 Process statistics over 609.281 seconds... ID State Domain %CPU StackUsed %StackUsed Name 5 RR i 81% 1000 24% idle_thread0 6 RR i 81% 968 23% idle_thread1 7 RR i 83% 952 23% idle_thread2 8 RR i 37% 952 23% idle_thread3 1560 BR w 33% 10624 32% wafl_exempt00 |
Screw this, let’s get crazy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
netapp2*> options stats.wafltop.config stats.wafltop.config volume,process netapp2*> options stats.wafltop.config volume,process,message netapp2*> wafltop start netapp2*> wafltop show -v cpu CPU Utilization Percent Application Total STRIPE VOL_LOG VOL_VBN VBN VOL AGGR_VBN AGGR SERIAL XCleaner ----------- -------- -------- -------- -------- -------- -------- -------- -------- -------- ------- aggr0:CSV_2008_1:walloc:WAFL_DELAYED_FREE_WO: 96 0 0 96 0 0 0 0 0 0 netapp2*> wafltop stop ... other output removed... |
WAFL_DELAYED_FREE_WO ?? Awwww bawls. This is going to screw my day up. So much for sleep or food any time soon.
Let’s just nip this off right now.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
netapp2*> wafl scan blk_reclaim CSV_2008_1 -f ##### This was the magic secret sauce with bbq. netapp2*> wafl scan status CSV_2008_1 Volume CSV_2008_1: Scan id Type of scan progress 17 active bitmap rearrangement fbn 1674 of 17356 w/ max_chain_len 3 420 container block reclamation block 319 of 17357 (fbn 319) ######### Wait a few minutes… netapp2*> sysstat -m 1 ANY AVG CPU0 CPU1 CPU2 CPU3 28% 10% 12% 12% 11% 4% 31% 12% 13% 14% 12% 7% 13% 5% 5% 6% 7% 3% 14% 5% 6% 6% 5% 4% |
Well, how’s about that crap. boooo ya!!!
Comments are closed.