Scrub is clean but smartmon reports errors on HDD

Hi all,

I am a bit short on knowledge to explain the following on my system. I have the following ZFS array on a FreeBSD 9.1 64GB + 4 vcores on ESXi with two MPS controllers in passthrough (under ESXi 5.1) and the two controllers run both the IT mode firmware:

Code:
  pool: zstuff
 state: ONLINE
  scan: scrub repaired 0 in 4h29m with 0 errors on Sat May 11 20:44:37 2013
config:

	NAME           STATE     READ WRITE CKSUM
	zstuff         ONLINE       0     0     0
	  raidz1-0     ONLINE       0     0     0
	    gpt/disk1  ONLINE       0     0     0
	    gpt/disk2  ONLINE       0     0     0
	    gpt/disk3  ONLINE       0     0     0
	    gpt/disk4  ONLINE       0     0     0
	    gpt/disk5  ONLINE       0     0     0

As you can see, I have done a recent and successful scrub.

Before that, I had issues where two HDDs of this array were seen as UNAVAIL after a smartmon 'long' test. They came back ONLINE after power cycling the host. At that time, I captured the following error messages:

Code:
May 11 15:28:57 softimage kernel: mps0: mpssas_alloc_tm freezing simq
May 11 15:28:58 softimage kernel: mps0: IOCStatus = 0x4b while resetting device 0xc
May 11 15:28:58 softimage kernel: mps0: IOCStatus = 0x4b while resetting device 0xb
May 11 15:28:58 softimage kernel: mps0: mpssas_free_tm releasing simq
May 11 15:28:58 softimage kernel: (da2:mps0:0:(pass3:9:mps0:0:0): lost device - 0 outstanding, 2 refs
May 11 15:28:58 softimage kernel: 9:0): passdevgonecb: devfs entry is gone
May 11 15:28:58 softimage kernel: (da4:mps0:0:11:0): lost device - 0 outstanding, 2 refs
May 11 15:28:58 softimage kernel: (pass5:mps0:0:11:0): passdevgonecb: devfs entry is gone
May 11 15:28:59 softimage kernel: (da4:mps0:0:11:0): removing device entry
May 11 15:28:59 softimage kernel: (da2:mps0:0:9:0): removing device entry

and here is the extract from dmesg for the two controllers:

Code:
mps0: <LSI SAS2008> port 0x5000-0x50ff mem 0xd2500000-0xd2503fff,0xd2540000-0xd257ffff irq 19 at device 0.0 on pci11
mps0: Firmware: 15.00.00.00, Driver: 14.00.00.01-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps1: <LSI SAS2008> port 0x6000-0x60ff mem 0xd2600000-0xd2603fff,0xd2640000-0xd267ffff irq 16 at device 0.0 on pci19
mps1: Firmware: 15.00.00.00, Driver: 14.00.00.01-fbsd
mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

Here is the output of the smartctl -a /dev/da4 and same output for the second drive (which has moved from da2 to da5 after the reboot since I changed the slot to try and identify a root cause.

Could you help me understand how a ZFS scrub can be successful with the kind of error reported by smartmon?

Am I correct in thinking that my best bet is to start the RMA process with WD by taking the non-healthy HDDs and run the WD diagnostics before returning them?

Note:

I have the following in /boot/loader.conf:

Code:
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"

Let me know if you need more details.

Boris
 
Try to replace the SATA cable. I've seen this before on one of my servers and replaced the SATA cable did the trick. Low grade SATA cable will do this. Try to use different SATA port, reconnect or replace the SATA cable.

If that doesn't work then you can run a diagnostic on the drive before RMAing it. If the drive passes with flying color then it's possible your controller card is going bad or poor driver support.
 
Ok thanks, I will have to verify the backplane, the cable and the controller since it's in a Norco 4224, I am not entirely sure the backplane is not somehow faulty. Has anybody experienced any issue due to power related issues? I have removed a few HDDs to reduce the power requirements. I have a Seasonic 750W but I use only two of the power lines out of the six available at the back of the PSU as seen here.

Is there any good tutorial to relate the issues reported by smartmon to, respectively, the controller, the cable, the lack of 'quality' power to the HDD (due to the backplane) and finally the HDD itself?

I guess it might not be that easy but I have read threads where people could isolate an issue to a component in a chain based on the type of error reported by smartmon.

Thanks!
 
Back
Top