Do not buy your FC HBAs from your server manufacturer!

My company has had 2 major issues with Qlogic HBAs (3 different models) in the last 6 months. Both cases, the issues were due to “branded” Qlogic cards. I am not saying QLogic is bad, not at all. I have used thousands over the years.

What I am saying is… don’t freaking buy an HBA with a server!  Don’t REUSE cards you have laying around if they were sourced from any server manufacturer or different brand SAN.

Really, don’t do it. It’s bad enough you are still using fiberchannel instead of 10gbe and NFS for stuff like Oracle. When you add on top of it a CPU heavy layer like FC, and then you have to deal with the freaking fiasco of firmware/bios/drivers/model certification… ugh what the hell. It’s a nightmare.

Issue #1: HP “branded” QLogic 8gb FC HBA.  What was wrong with this?  Well… the joyous beauty would work WONDERFULLY while direct connected, but the moment you plugged it into a switch, it stopped functioning right.  All sorts of weird ass issues.

Yes, IDLE word port config was toggled to the new method.  That wasn’t the issue.  No one could figure out what was wrong.  Not QLE, not Brocade, not NetApp.  We switched to another FC HBA…

Issue #2:  A recent personal hell… 4 cards, 2 models.  QLE2462 and QLE2562.  All sourced through Dell.  Tried on Windows AND Linux.  This was intended as an Oracle RAC cluster.

Windows being a steaming pile of cow dung didn’t actually list any errors, it would simply stall on IO, and STONITH one of the hosts into a reboot loop.  (STONITH= shoot the other node in the head.  When one node sees the other is wonking out so it evicts it from the RAC cluster, and then reboots it)

We moved over to Linux, thank God!  Then we were able to see all the FCP Scsi abort errors being listed.

Also, IO would simply stall. Not just that, but the system WOULD NOT KNOW it stalled!!!

Check this crap out.

A good response time on the “real” and “sys” would be about 5 seconds.  Every few runs, the box would just go mad, and then sit there hung on this process for up to 2 minutes, with the sys wall clock actually showing only 4-5 seconds.  strace would show nothing wrong.  It too would be hung.  Then the kernel would go tits up and of course it would time out on the kernel.

Yeah when it ran right, it would be fast as hell.  But who can tolerate 1/4 of your io being stuck in a 2 minute limbo.

We were plagued with these errors from all the qlogic cards on hand:

We replaced out every fiber cable and SFP in between.  We tried different ports on the switch.  All the same issues.  One by one, removing the possibilities of what the issue could be caused by.  4 different kernels, 2 different models of cards, multiple card slots in the servers, 2 different servers, etc etc etc etc.

We used ALL the recommended firmware and drivers required by NetApp.  Not only that, but QLogic cards inject drivers

I tuned the hell out of the multipath.conf.  I removed multipathing.  I did what ever I could.  All the while, the customers deadline approached.  grrrrrr the life of project work.  When ever you have fiberchannel, always pad your deadline with 5x more time than you need, because you WILL hit weird ass issues.   God, I wish every Oracle project could be 10gbe and NFS/dNFS!

Since this was Oracle Enterprise Linux (hey don’t badmouth it… most of my customers are running it…) we decided to call Oracle.  When the customer called Oracle they went through the debugging, and said “not my issue”.  Brocade was called “not my issue” they said.  QLogic was called… they said “hmmm wtf”.  Yeah, WTF!

So, QLogic dug deeper…

I’m taking some artistic license and toning down the vulgarity of my frustration of this 2-month long debugging process to get to the bottom of the issue but here is how the last week went:

QLogic: hey, wait… did you buy this card from dell?
Us: You mean “cards”?  Ummm yeah, why?
QLogic: “Ohh this is a QLE2562-DELL card, it’s only certified for Compellent. You can trade it in and buy a new one that’s certified for NetApp”
Me in my head: You got to be kidding… that’s a joke right?
Customer: What?
Me: This is the dumbest shit I ever heard.  Why not just be able to flash the damn thing with the compatible bios?  Why hardcode and break shit?  I hate FC most times…

So… we bought a set of Emulex cards off a NetApp SKU.  They worked.  Problem solved.  Yeah.

Not only that, but the “inbox” driver, the ones that are baked into the kernel, “just work” and are supported by NetApp.  That makes life so much easier than trying to track down the EXACT tgz download of some weird ass QLogic driver which only works for that one minor kernel point release, which is like a needle in a freaking haystack!

So, now my work continues, getting back to the original issue I was working on with RAC where a certain command was reporting a NULL db name and keeping me from expand to another node.







Be Sociable, Share!

, , , , , ,

Comments are closed.