ZFS Disk check fails to show health

Hi,
We are trying to use smartmontools (smartctl command) to check the health of our file servers disks.
We have an LSI MegaRaid (Dell branded) controller attached to a DAS array. Each of the 12 disks are a single-disk volume, presenting as mfidXX. I'm aware that there are /dev/passYY devices for these. How do I translate from the mfidXX to the passYY device in order to run smartctl -a /dev/passYY?

It seems these two numbering schemes don't line up. The output from mfiutil show drives below shows the physical drives - the first number doesn't seem to line up with /dev/passYY either - e.g. smartctl -a /dev/pass6 says it's an enclosure device, not a drive.
Any help to understand how to understand this system would be appreciated, thanks so much.

mfiutil show drives
Code:
mfi0 Physical Drives:
4 (  559G) ONLINE <SEAGATE ST3600057SS ES66 serial=6SL52HK2> SAS E1:S2
5 (  559G) ONLINE <SEAGATE ST3600057SS ES66 serial=6SL52H4L> SAS E1:S5
6 (  559G) ONLINE <SEAGATE ST3600057SS ES66 serial=6SL51047> SAS E1:S4
7 (  559G) ONLINE <SEAGATE ST3600057SS ES66 serial=6SL510BV> SAS E1:S3
8 (  559G) ONLINE <SEAGATE ST3600957SS ESF5 serial=6SL4Q19Y> SAS E1:S8
9 (  559G) ONLINE <SEAGATE ST3600957SS ESF5 serial=6SL4Q0X7> SAS E1:S7
10 (  559G) ONLINE <SEAGATE ST3600057SS ES65 serial=6SL4ZYYY> SAS E1:S6
11 (  559G) ONLINE <SEAGATE ST3600057SS ES65 serial=6SL50DPP> SAS E1:S9
12 (  559G) ONLINE <SEAGATE ST3600057SS ES66 serial=6SL52GYR> SAS E1:S1
13 (  559G) ONLINE <SEAGATE ST3600057SS ES66 serial=6SL510R5> SAS E1:S0
14 (  559G) ONLINE <SEAGATE ST3600957SS ESF5 serial=6SL4LE2T> SAS E1:S10
15 (  559G) ONLINE <SEAGATE ST3600957SS ESF5 serial=6SL4PCCG> SAS E1:S11
 
AFAIK, Seagate SAS drives sadly do not have S.M.A.R.T. statistics interface. They do only their proporietary stuff.

You can use Seatools for DOS or Windows.
It only displays "PASS" or "FAIL".
In case of errors, it allows you to make a "repair run" (sector remapping I guess) also.
This can make unusable disks work again.
I did this last month with a Cheetah 15K6 that was in the shelf for a year and, after taken back in use, refused to write on some sectors when wiping using dd.
 
The HDDs should have support S.M.A.R.T. according to
https://www.seagate.com/files/www-c...s/enterprise/Cheetah/15K.7/SAS/100516226f.pdf

Maybe it is just disabled?

"Controlling S.M.A.R.T.
The operating mode of S.M.A.R.T. is controlled by the DEXCPT and PERF bits on the Informational Exceptions Control mode page (1Ch). Use the DEXCPT bit to enable or disable the S.M.A.R.T. feature. Setting the DEXCPT bit disables all S.M.A.R.T. functions."

Or there is a problem with the controller driver.
 
Not all mfi(4) based controllers pass on access to SMART data.
Those that don't may well monitor the SMART status themselves, and flag drives as failed/failing in the MegaRaid tools if SMART numbers reach failure thresholds. So many variations of MegaRaid out there, and each OEM customises them a little bit as well.
 
We are trying to use smartmontools (smartctl command) to check the health of our file servers disks.
We have an LSI MegaRaid (Dell branded) controller attached to a DAS array. Each of the 12 disks are a single-disk volume, presenting as mfidXX. I'm aware that there are /dev/passYY devices for these. How do I translate from the mfidXX to the passYY device in order to run smartctl -a /dev/passYY?
# camcontrol devlist

This is on a PowerEdge R710 with a PERC H700:
Code:
(0:1) host:~terry# camcontrol devlist
<SEAGATE ST9146852SS HT66>         at scbus0 target 0 lun 0 (pass0)
<SEAGATE ST9146852SS HT66>         at scbus0 target 1 lun 0 (pass1)
<SEAGATE ST9146852SS HT66>         at scbus0 target 2 lun 0 (pass2)
<SEAGATE ST9146852SS HT66>         at scbus0 target 3 lun 0 (pass3)
<SEAGATE ST9146852SS HT66>         at scbus0 target 4 lun 0 (pass4)
<SEAGATE ST9146852SS HT66>         at scbus0 target 5 lun 0 (pass5)
<DP BACKPLANE 1.07>                at scbus0 target 32 lun 0 (ses0,pass6)
If each drive shows up as a volume, you'll probably see something like:
Code:
<SEAGATE ST9146852SS HT66>         at scbus0 target 0 lun 0 (mfid0,pass0)
instead of just the /dev/passN devices.

Note that you'll need to load the undocumented mfip(4) module to have the individual disks show up as /dev/passN devices:
Code:
(0:1) host:~terry# kldload mfip
You'll also want to add:
Code:
mfip_load="yes"
to /boot/loader.conf. If you (actually someone else reading this in the future, as your drives are SAS) have SATA devices, you'll probably get a warning about "MegaRAID SAT layer is reportedly buggy, use '-d sat' to try anyhow" from # smartctl. Once you can access them with # smartctl, you can compare the reported serial number for each /dev/passN with the serial number shown in the PERC's BIOS setup menu. The Dell controllers are usually pretty well-behaved and don't shuffle the /dev/passN devices on you. Some of the genuine LSI controllers do that, which can be annoying.

If you run into a mfi(4) controller that doesn't expose the /dev/passN devices even after loading mfip(4), there is a tunable that will force the controller to make the individual drives show up, even when they are members of a volume. This is not recommended, as it makes it easy to clobber the data on the individual drives. It is hw.mfi.allow_cam_disk_passthrough if you absolutely have to do it.
 
Back
Top