Other isp(4) QLE2462 initiator failure with 10.3-RELEASE

robroy · Oct 4, 2016

FreeBSD Friends,

A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at work, which access ZFS pools on Hitachi Modular and Enterprise family arrays. FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and sees four paths to each LU.

Here's a drawing showing the basic idea. Not shown in the drawing are a few more storage arrays (of the same types), and various, additional FC switches in the fabric between the computer and the arrays.

The problem

The first FC HBA port, isp0 stopped working spontaneously, after several weeks of uptime with light I/O. All LU paths automatically failed over to isp1, yet paths through isp0 remain non-functional even now.

The first sign of trouble appeared in /var/log/messages, followed by many more similar errors for other LU paths:

Code:

isp0: Chan 0 Abort Cmd for N-Port 0x0005 @ Port 0x111300
isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
isp0: isp_watchdog: timeout for handle 0x6570200d
(da5:isp0:0:4:1): FIN dl16384 resid 0 CDB=0x2a 0x00 0x03 0x51 0x1b 0xe5 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
(da5:isp0:0:4:1): WRITE(10). CDB: 2a 00 03 51 1b e5 00 00 20 00
(da5:isp0:0:4:1): CAM status: Command timeout
(da5:isp0:0:4:1): Retrying command

These caused successful fail-overs to paths through isp1, which looked like this in /var/log/messages:

Code:

(da5:isp0:0:4:1): Error 5, Retries exhausted
GEOM_MULTIPATH: Error 5, da5 in 85040360_0999 marked FAIL
GEOM_MULTIPATH: da17 is now active path in 85040360_0999

What I've already tried

I tried manually failing back to paths through isp0 with commands like gmultipath restore 66209_002E da2 followed by gmultipath rotate 66209_002E. When I/Os are tried over isp0, it shows the same, original symptom (shown below in context), until it fails back to a path through isp1.

Code:

GEOM_MULTIPATH: da3 in 66209_002E is marked OK.
GEOM_MULTIPATH: da3 is now active path in 66209_002E
isp0: Chan 0 Abort Cmd for N-Port 0x0004 @ Port 0x0e2000
isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
isp0: isp_watchdog: timeout for handle 0x65a7200d
(da3:isp0:0:3:0): FIN dl2560 resid 0 CDB=0x2a 0x00 0x04 0x2a 0xa7 0x89 0x00 0x00 0x05 0x00  STS 0x0 XS_ERR=0xb
(da3:isp0:0:3:0): WRITE(10). CDB: 2a 00 04 2a a7 89 00 00 05 00
(da3:isp0:0:3:0): CAM status: Command timeout
(da3:isp0:0:3:0): Retrying command

I've tried failing over to every possible array target for an LU, over isp0; it was the same for each target.
I've tried replacing every fiber optic cabling segment between the isp0 HBA port and the switch; the behavior was unchanged.
I've tried physically swapping the isp0 and isp1 HBA port connections--the symptom stuck to isp0, even when its I/Os were being attempted through the physical connection formerly used (successfully) by isp1.
I've tried disabling and re-enabling the Brocade switch port. When the port was enabled, it assumed the In_Sync state (instead of the Online state it shows when it's working):
Code:
```
    2   2   150200   id    N4       In_Sync     FC
```

Computer information

This is a Hitachi CR220H, which is based on an MSI S0051a motherboard.

FC HBA information

This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29. ispfw(4)'s being used, and claims to have successfully placed its own firmware on the card during boot, presumably over-riding the levels I flashed (mentioned here).

Related sysctl(8)s:

Code:

# sysctl -a | grep dev.isp
dev.isp.1.topo: 3
dev.isp.1.loopstate: 9
dev.isp.1.fwstate: 3
dev.isp.1.linkstate: 1
dev.isp.1.speed: 4
dev.isp.1.role: 2
dev.isp.1.gone_device_time: 30
dev.isp.1.loop_down_limit: 60
dev.isp.1.wwpn: 2378182195041974935
dev.isp.1.wwnn: 2305843126027336343
dev.isp.1.%parent: pci3
dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
dev.isp.1.%location: pci0:3:0:1
dev.isp.1.%driver: isp
dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.0.topo: 3
dev.isp.0.loopstate: 9
dev.isp.0.fwstate: 3
dev.isp.0.linkstate: 1
dev.isp.0.speed: 4
dev.isp.0.role: 2
dev.isp.0.gone_device_time: 30
dev.isp.0.loop_down_limit: 60
dev.isp.0.wwpn: 2377900720063167127
dev.isp.0.wwnn: 2305843126025239191
dev.isp.0.%parent: pci3
dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
dev.isp.0.%location: pci0:3:0:0
dev.isp.0.%driver: isp
dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.%parent:

FC switch information

Each FC HBA port's attached to a (separate) Brocade 6510 running FOS v7.4.1. The symptom's not specific to either of these switches (I tried swapping the connections around, and the symptom stuck to isp0).

Array information

LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are visible over the QLE2462. When this problem happens, the behavior's uniform for all array paths; the symptom's not specific to any one array, or array family.

What's happening now

I'm guessing that this problem would temporarily go away if I rebooted the computer, yet we won't be able to continue on with the project until we figure out what happened to isp0--we're afraid that it'll happen again, naturally at the most inopportune time possible. So the computer's still in its problem state now.

Thanks so much for any words of wisdom,
Robroy

robroy · Oct 6, 2016

I've mailed the freebsd-questions mailing list about this; 'sorry to duplicate the question in two places. Here's what I sent. Thanks for taking a gander!

mav@ · Oct 15, 2016

I guess firmware on isp0 may go south for some reason. That is absolutely black box for us. If you can not reboot, I would try to reinitialize it by disabling the port and reenabling it back by setting dev.isp.0.role sysctl to zero and then resetting back to 2.

robroy · Oct 15, 2016

mav@, thank you so much for replying!

mav@ said:
I guess firmware on isp0 may go south for some reason. That is absolutely black box for us.

Ah, okay. Thanks! I did flash the card with firmware revision 8.01.02, and I think ispfw(4) is loading an older version during boot. Before turning on ispfw(4), I noticed that isp(4) declined to latch on to the card, presumably because of this newer flashed firmware level.

Do you think I should re-flash the card with the same (older) version ispfw(4) loads?

mav@ said:
If you can not reboot, I would try to reinitialize it by disabling the port and reenabling it back by setting dev.isp.0.role sysctl to zero and then resetting back to 2.

Wow, thanks so much; I didn't know that this was possible. If this happens again, I'll try this and post the results.

We actually did wind up rebooting the computer, to carry on with it in other ways. And the card returned to perfect working condition following the reboot.

I wish I knew of a way to reproduce this, yet for now we'll just have to wait and see if it happens again.

mav@, I can't thank you enough for taking a gander at this and replying. I sure appreciate it.

mav@ · Oct 16, 2016

robroy said:
Before turning on ispfw(4), I noticed that isp(4) declined to latch on to the card, presumably because of this newer flashed firmware level.

That is why ispfw(4) exist -- to supply firmware that was tested with the isp(4) driver.

robroy said:
Do you think I should re-flash the card with the same (older) version ispfw(4) loads?

It should not matter as long as you load ispfw(4).

robroy · Oct 16, 2016

Thanks very much mav@! I'll carry on then without re-flashing the card.

robroy · May 8, 2017

Update for those curious about how this turned out: the computer has been running steadily ever since (with very light I/O going over the QLogic), and the symptom hasn't happened again. It has been completely reliable.

mav@, thank you again for your kind and helpful response last year. If this does ever happen again, I feel much better knowing that I have something to try. Thank you.