FreeBSD Friends,
A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at work, which access ZFS pools on Hitachi Modular and Enterprise family arrays. FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and sees four paths to each LU.
Here's a drawing showing the basic idea. Not shown in the drawing are a few more storage arrays (of the same types), and various, additional FC switches in the fabric between the computer and the arrays.
The problem
The first FC HBA port, isp0 stopped working spontaneously, after several weeks of uptime with light I/O. All LU paths automatically failed over to isp1, yet paths through isp0 remain non-functional even now.
The first sign of trouble appeared in /var/log/messages, followed by many more similar errors for other LU paths:
These caused successful fail-overs to paths through isp1, which looked like this in /var/log/messages:
What I've already tried
Computer information
This is a Hitachi CR220H, which is based on an MSI S0051a motherboard.
FC HBA information
This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29. ispfw(4)'s being used, and claims to have successfully placed its own firmware on the card during boot, presumably over-riding the levels I flashed (mentioned here).
Related sysctl(8)s:
FC switch information
Each FC HBA port's attached to a (separate) Brocade 6510 running FOS v7.4.1. The symptom's not specific to either of these switches (I tried swapping the connections around, and the symptom stuck to isp0).
Array information
LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are visible over the QLE2462. When this problem happens, the behavior's uniform for all array paths; the symptom's not specific to any one array, or array family.
What's happening now
I'm guessing that this problem would temporarily go away if I rebooted the computer, yet we won't be able to continue on with the project until we figure out what happened to isp0--we're afraid that it'll happen again, naturally at the most inopportune time possible. So the computer's still in its problem state now.
Thanks so much for any words of wisdom,
Robroy
A chum and I have been setting up some FreeBSD 10.3-RELEASE servers at work, which access ZFS pools on Hitachi Modular and Enterprise family arrays. FreeBSD attaches to the Brocade fabric with a QLE2462 FC HBA, and sees four paths to each LU.
Here's a drawing showing the basic idea. Not shown in the drawing are a few more storage arrays (of the same types), and various, additional FC switches in the fabric between the computer and the arrays.
The problem
The first FC HBA port, isp0 stopped working spontaneously, after several weeks of uptime with light I/O. All LU paths automatically failed over to isp1, yet paths through isp0 remain non-functional even now.
The first sign of trouble appeared in /var/log/messages, followed by many more similar errors for other LU paths:
Code:
isp0: Chan 0 Abort Cmd for N-Port 0x0005 @ Port 0x111300
isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733)
isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT)
isp0: isp_watchdog: timeout for handle 0x6570200d
(da5:isp0:0:4:1): FIN dl16384 resid 0 CDB=0x2a 0x00 0x03 0x51 0x1b 0xe5 0x00 0x00 0x20 0x00 STS 0x0 XS_ERR=0xb
(da5:isp0:0:4:1): WRITE(10). CDB: 2a 00 03 51 1b e5 00 00 20 00
(da5:isp0:0:4:1): CAM status: Command timeout
(da5:isp0:0:4:1): Retrying command
These caused successful fail-overs to paths through isp1, which looked like this in /var/log/messages:
Code:
(da5:isp0:0:4:1): Error 5, Retries exhausted
GEOM_MULTIPATH: Error 5, da5 in 85040360_0999 marked FAIL
GEOM_MULTIPATH: da17 is now active path in 85040360_0999
What I've already tried
- I tried manually failing back to paths through isp0 with commands like
gmultipath restore 66209_002E da2
followed bygmultipath rotate 66209_002E
. When I/Os are tried over isp0, it shows the same, original symptom (shown below in context), until it fails back to a path through isp1.
Code:GEOM_MULTIPATH: da3 in 66209_002E is marked OK. GEOM_MULTIPATH: da3 is now active path in 66209_002E isp0: Chan 0 Abort Cmd for N-Port 0x0004 @ Port 0x0e2000 isp0: Polled Mailbox Command (0x54) Timeout (5000000us) (started @ isp_control:4733) isp0: Mailbox Command 'EXECUTE IOCB A64' failed (TIMEOUT) isp0: isp_watchdog: timeout for handle 0x65a7200d (da3:isp0:0:3:0): FIN dl2560 resid 0 CDB=0x2a 0x00 0x04 0x2a 0xa7 0x89 0x00 0x00 0x05 0x00 STS 0x0 XS_ERR=0xb (da3:isp0:0:3:0): WRITE(10). CDB: 2a 00 04 2a a7 89 00 00 05 00 (da3:isp0:0:3:0): CAM status: Command timeout (da3:isp0:0:3:0): Retrying command
- I've tried failing over to every possible array target for an LU, over isp0; it was the same for each target.
- I've tried replacing every fiber optic cabling segment between the isp0 HBA port and the switch; the behavior was unchanged.
- I've tried physically swapping the isp0 and isp1 HBA port connections--the symptom stuck to isp0, even when its I/Os were being attempted through the physical connection formerly used (successfully) by isp1.
- I've tried disabling and re-enabling the Brocade switch port. When the port was enabled, it assumed the In_Sync state (instead of the Online state it shows when it's working):
Code:2 2 150200 id N4 In_Sync FC
Computer information
This is a Hitachi CR220H, which is based on an MSI S0051a motherboard.
FC HBA information
This is a QLE2462 at firmware level 8.01.02 and BIOS level 3.29. ispfw(4)'s being used, and claims to have successfully placed its own firmware on the card during boot, presumably over-riding the levels I flashed (mentioned here).
Related sysctl(8)s:
Code:
# sysctl -a | grep dev.isp
dev.isp.1.topo: 3
dev.isp.1.loopstate: 9
dev.isp.1.fwstate: 3
dev.isp.1.linkstate: 1
dev.isp.1.speed: 4
dev.isp.1.role: 2
dev.isp.1.gone_device_time: 30
dev.isp.1.loop_down_limit: 60
dev.isp.1.wwpn: 2378182195041974935
dev.isp.1.wwnn: 2305843126027336343
dev.isp.1.%parent: pci3
dev.isp.1.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
dev.isp.1.%location: pci0:3:0:1
dev.isp.1.%driver: isp
dev.isp.1.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.0.topo: 3
dev.isp.0.loopstate: 9
dev.isp.0.fwstate: 3
dev.isp.0.linkstate: 1
dev.isp.0.speed: 4
dev.isp.0.role: 2
dev.isp.0.gone_device_time: 30
dev.isp.0.loop_down_limit: 60
dev.isp.0.wwpn: 2377900720063167127
dev.isp.0.wwnn: 2305843126025239191
dev.isp.0.%parent: pci3
dev.isp.0.%pnpinfo: vendor=0x1077 device=0x2432 subvendor=0x1077 subdevice=0x0138 class=0x0c0400
dev.isp.0.%location: pci0:3:0:0
dev.isp.0.%driver: isp
dev.isp.0.%desc: Qlogic ISP 2432 PCI FC-AL Adapter
dev.isp.%parent:
FC switch information
Each FC HBA port's attached to a (separate) Brocade 6510 running FOS v7.4.1. The symptom's not specific to either of these switches (I tried swapping the connections around, and the symptom stuck to isp0).
Array information
LUs from both Hitachi Modular (AMS) and Enterprise (VSP) arrays are visible over the QLE2462. When this problem happens, the behavior's uniform for all array paths; the symptom's not specific to any one array, or array family.
What's happening now
I'm guessing that this problem would temporarily go away if I rebooted the computer, yet we won't be able to continue on with the project until we figure out what happened to isp0--we're afraid that it'll happen again, naturally at the most inopportune time possible. So the computer's still in its problem state now.
Thanks so much for any words of wisdom,
Robroy
Last edited: