Hi all,
I am a bit short on knowledge to explain the following on my system. I have the following ZFS array on a FreeBSD 9.1 64GB + 4 vcores on ESXi with two MPS controllers in passthrough (under ESXi 5.1) and the two controllers run both the IT mode firmware:
As you can see, I have done a recent and successful scrub.
Before that, I had issues where two HDDs of this array were seen as UNAVAIL after a smartmon 'long' test. They came back ONLINE after power cycling the host. At that time, I captured the following error messages:
and here is the extract from dmesg for the two controllers:
Here is the output of the
Could you help me understand how a ZFS scrub can be successful with the kind of error reported by smartmon?
Am I correct in thinking that my best bet is to start the RMA process with WD by taking the non-healthy HDDs and run the WD diagnostics before returning them?
Note:
I have the following in /boot/loader.conf:
Let me know if you need more details.
Boris
I am a bit short on knowledge to explain the following on my system. I have the following ZFS array on a FreeBSD 9.1 64GB + 4 vcores on ESXi with two MPS controllers in passthrough (under ESXi 5.1) and the two controllers run both the IT mode firmware:
Code:
pool: zstuff
state: ONLINE
scan: scrub repaired 0 in 4h29m with 0 errors on Sat May 11 20:44:37 2013
config:
NAME STATE READ WRITE CKSUM
zstuff ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 0
gpt/disk2 ONLINE 0 0 0
gpt/disk3 ONLINE 0 0 0
gpt/disk4 ONLINE 0 0 0
gpt/disk5 ONLINE 0 0 0
As you can see, I have done a recent and successful scrub.
Before that, I had issues where two HDDs of this array were seen as UNAVAIL after a smartmon 'long' test. They came back ONLINE after power cycling the host. At that time, I captured the following error messages:
Code:
May 11 15:28:57 softimage kernel: mps0: mpssas_alloc_tm freezing simq
May 11 15:28:58 softimage kernel: mps0: IOCStatus = 0x4b while resetting device 0xc
May 11 15:28:58 softimage kernel: mps0: IOCStatus = 0x4b while resetting device 0xb
May 11 15:28:58 softimage kernel: mps0: mpssas_free_tm releasing simq
May 11 15:28:58 softimage kernel: (da2:mps0:0:(pass3:9:mps0:0:0): lost device - 0 outstanding, 2 refs
May 11 15:28:58 softimage kernel: 9:0): passdevgonecb: devfs entry is gone
May 11 15:28:58 softimage kernel: (da4:mps0:0:11:0): lost device - 0 outstanding, 2 refs
May 11 15:28:58 softimage kernel: (pass5:mps0:0:11:0): passdevgonecb: devfs entry is gone
May 11 15:28:59 softimage kernel: (da4:mps0:0:11:0): removing device entry
May 11 15:28:59 softimage kernel: (da2:mps0:0:9:0): removing device entry
and here is the extract from dmesg for the two controllers:
Code:
mps0: <LSI SAS2008> port 0x5000-0x50ff mem 0xd2500000-0xd2503fff,0xd2540000-0xd257ffff irq 19 at device 0.0 on pci11
mps0: Firmware: 15.00.00.00, Driver: 14.00.00.01-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps1: <LSI SAS2008> port 0x6000-0x60ff mem 0xd2600000-0xd2603fff,0xd2640000-0xd267ffff irq 16 at device 0.0 on pci19
mps1: Firmware: 15.00.00.00, Driver: 14.00.00.01-fbsd
mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Here is the output of the
smartctl -a /dev/da4
and same output for the second drive (which has moved from da2 to da5 after the reboot since I changed the slot to try and identify a root cause.Could you help me understand how a ZFS scrub can be successful with the kind of error reported by smartmon?
Am I correct in thinking that my best bet is to start the RMA process with WD by taking the non-healthy HDDs and run the WD diagnostics before returning them?
Note:
I have the following in /boot/loader.conf:
Code:
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"
Let me know if you need more details.
Boris