Solved Replace failed drive in raidz2

The following is the zdata array in a FreeBSD 13.2-RELEASE system:

Code:
HD Device     Passthrough Device  Serial     GPT               Model                   Firmware  Slot #
mfisyspd0     pass0             V8H9UVMR     data_disk11       HGST HUS726T6TALE6L4     40H      0
mfisyspd1     pass1             V9G5S89L     data_disk12_1     HGST HUS726T6TALE6L4     984      1
mfisyspd2     pass2             V9HTSEEL     data_disk13_1     HGST HUS726T6TALE6L4     984      2
mfisyspd3     pass3             V8H9DH1R     data_disk14       HGST HUS726T6TALE6L4     40H      3
mfisyspd4     pass4             V9G81MWL     data_disk15_1     HGST HUS726T6TALE6L4     460      4
mfisyspd5     pass5             V8H9US9R     data_disk16       HGST HUS726T6TALE6L4     40H      5
mfisyspd6     pass6             V8H9V1LR     data_disk17       HGST HUS726T6TALE6L4     40H      6
mfisyspd7     pass7             V9H3L26L     data_disk18_2     HGST HUS726T6TALE6L4     984      7
mfisyspd8     pass8             V8KZXZWF     data_disk19_1     HGST HUS726T6TALE6L4     984      8
mfisyspd9     pass10            V8H9G39R     data_disk10       HGST HUS726T6TALE6L4     40H      9

The failing drive is the one in slot 9 with the HD device name of mfisyspd9 and the passthrough name of pass10.

This is the procedure that I use to replace the failing drive:

1) take out failing drive
2) insert replacement drive
3) gpart as follows:
# gpart create -s gpt mfisyspd9
# gpart add -t freebsd-zfs -l data_disk10_1 mfisyspd9
4) confirm gpt structure
# gpart backup mfisyspd9
5) start replacement of failing drive (may take a day or so to resilver)
# zpool replace zdata gpt/data_disk10 gpt/data_disk10_1

I've successfully replaced a few drives in this manner in the past. However, this time, when I inserted the new drive, it did not get recognized as mfisyspd9. When I attempt to 'gpart create -s gpt mfisyspd9', it barfed an error message as follows:

gpart: arg0 'mfisyspd9': invalid argument

Running 'camcontrol devlist -v' shows the new disk as pass10 which is expected but when I enumerate the /dev directory, it does not show an entry for the mfisyspd9 device. I notice that the new drive has a newer firmware than the rest of the drives which have had three various firmware versions ( 40H- the original version, 460 and 984). The firmware of this new drive is 9G0. Would this explain the reason why the drive fails to show up as mfisyspd9?

To compound the issue further, while researching this issue, I noted the use of the 'zpool offline' command to offline the failing drive. I used it to offline the failing drive. I then swapped out the bad one for the good drive. In the process of troubleshooting, I accidentally 'zpool online' the good drive. I offlined it and swapped it back to the failing drive and tried to online it. It failed saying that it was not the expected drive.

At this point, I am not sure what steps I need to take. Do I need to run the following command:

# zpool replace zdata gpt/data_disk10 gpt/data_disk10

It seems counterintuitive in that I'm using the same gpt label...

Please advise.

~Doug
 
When I remove and reinsert the new drive, this is what is shown in /var/log/messages:

Code:
Apr  3 09:20:03 backup kernel: [511279] mfi0: 12754 (765451187s/0x0002/WARN) - Removed: PD 10(e0x0f/s9)
Apr  3 09:20:03 backup kernel: [511279] mfi0: 12755 (765451187s/0x0002/info) - Removed: PD 10(e0x0f/s9) Info: enclPd=0f, scsiType=0, portMap=00, sasAddr=50030480010e4d35,0000000000000000
Apr  3 09:20:03 backup kernel: [511279] mfi0: 12756 (765451187s/0x0002/info) - State change on PD 10(e0x0f/s9) from JBOD(40) to UNCONFIGURED_BAD(1)
Apr  3 09:20:21 backup kernel: [511296] mfi0: 12757 (765451205s/0x0002/info) - Inserted: PD 10(e0x0f/s9)
Apr  3 09:20:21 backup kernel: [511296] mfi0: 12758 (765451205s/0x0002/info) - Inserted: PD 10(e0x0f/s9) Info: enclPd=0f, scsiType=0, portMap=00, sasAddr=50030480010e4d35,0000000000000000
Apr  3 09:20:25 backup kernel: [511300] ses0: pass10 in 'Slot 10', SAS Slot: 1 phys at slot 9
Apr  3 09:20:25 backup kernel: [511300] ses0:  phy 0: SATA device
Apr  3 09:20:25 backup kernel: [511300] ses0:  phy 0: parent 50030480010e4d3f addr 50030480010e4d35

Code:
[root@backup 03.Apr 8:56am ~]# camcontrol devlist -v
scbus0 on ahd0 bus 0:
<>                                 at scbus0 target -1 lun ffffffff ()
scbus1 on mfi0 bus 0:
<ATA HGST HUS726T6TAL W40H>        at scbus1 target 4 lun 0 (pass0)
<ATA HGST HUS726T6TAL W984>        at scbus1 target 5 lun 0 (pass1)
<ATA HGST HUS726T6TAL W984>        at scbus1 target 6 lun 0 (pass2)
<ATA HGST HUS726T6TAL W40H>        at scbus1 target 7 lun 0 (pass3)
<ATA HGST HUS726T6TAL W984>        at scbus1 target 8 lun 0 (pass4)
<ATA HGST HUS726T6TAL W40H>        at scbus1 target 9 lun 0 (pass5)
<ATA HGST HUS726T6TAL W40H>        at scbus1 target 10 lun 0 (pass6)
<ATA HGST HUS726T6TAL W984>        at scbus1 target 11 lun 0 (pass7)
<ATA HGST HUS726T6TAL W984>        at scbus1 target 12 lun 0 (pass8)
<LSI SAS2X28 0e12>                 at scbus1 target 15 lun 0 (ses0,pass9)
<ATA HGST HUS726T6TAL W9G0>        at scbus1 target 16 lun 0 (pass10)                       <<<<--------------------------------------------------
scbus2 on ahcich0 bus 0:
<INTEL SSDSC2BB600G4 D2010370>     at scbus2 target 0 lun 0 (ada0,pass11)
<>                                 at scbus2 target -1 lun ffffffff ()
scbus3 on ahcich1 bus 0:
<INTEL SSDSC2BB600G4 D2010370>     at scbus3 target 0 lun 0 (ada1,pass12)
<>                                 at scbus3 target -1 lun ffffffff ()
scbus4 on ahcich2 bus 0:
<>                                 at scbus4 target -1 lun ffffffff ()
scbus5 on ahcich3 bus 0:
<>                                 at scbus5 target -1 lun ffffffff ()
scbus6 on ahcich4 bus 0:
<>                                 at scbus6 target -1 lun ffffffff ()
scbus7 on ahcich5 bus 0:
<>                                 at scbus7 target -1 lun ffffffff ()
scbus8 on ahciem0 bus 0:
<AHCI SGPIO Enclosure 2.00 0001>   at scbus8 target 0 lun 0 (ses1,pass13)
<>                                 at scbus8 target -1 lun ffffffff ()
scbus9 on umass-sim0 bus 0:
<WD Elements 25A3 1030>            at scbus9 target 0 lun 0 (pass14,da0)
scbus10 on umass-sim1 bus 1:
<WD My Book 25EE 4004>             at scbus10 target 0 lun 0 (da1,pass15)
<WD SES Device 4004>               at scbus10 target 0 lun 1 (pass16,ses2)
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun ffffffff (xpt0)
[root@backup 03.Apr 9:16am ~]#

The new drive is listed as pass10 above.

Code:
[root@backup 03.Apr 9:16am ~]# ll /dev/mfi*
crw-r-----  1 root  operator  0x30 Mar 28 11:18 /dev/mfi0
crw-r-----  1 root  operator  0x5f Mar 28 11:18 /dev/mfisyspd0
crw-r-----  1 root  operator  0x60 Mar 28 11:18 /dev/mfisyspd0p1
crw-r-----  1 root  operator  0x62 Mar 28 11:18 /dev/mfisyspd1
crw-r-----  1 root  operator  0x63 Mar 28 11:18 /dev/mfisyspd1p1
crw-r-----  1 root  operator  0x65 Mar 28 11:18 /dev/mfisyspd2
crw-r-----  1 root  operator  0x67 Mar 28 11:18 /dev/mfisyspd2p1
crw-r-----  1 root  operator  0x68 Mar 28 11:18 /dev/mfisyspd3
crw-r-----  1 root  operator  0x6a Mar 28 11:18 /dev/mfisyspd3p1
crw-r-----  1 root  operator  0x6b Mar 28 11:18 /dev/mfisyspd4
crw-r-----  1 root  operator  0x6e Mar 28 11:18 /dev/mfisyspd4p1
crw-r-----  1 root  operator  0x6c Mar 28 11:18 /dev/mfisyspd5
crw-r-----  1 root  operator  0x6f Mar 28 11:18 /dev/mfisyspd5p1
crw-r-----  1 root  operator  0x70 Mar 28 11:18 /dev/mfisyspd6
crw-r-----  1 root  operator  0x76 Mar 28 11:18 /dev/mfisyspd6p1
crw-r-----  1 root  operator  0x72 Mar 28 11:18 /dev/mfisyspd7
crw-r-----  1 root  operator  0x77 Mar 28 11:18 /dev/mfisyspd7p1
crw-r-----  1 root  operator  0x74 Mar 28 11:18 /dev/mfisyspd8
crw-r-----  1 root  operator  0x78 Mar 28 11:18 /dev/mfisyspd8p1
[root@backup 03.Apr 9:22am ~]#

The new drive doesn't show up as mfisyspd9 above. It should be but doesn't.

Code:
[root@backup 02.Apr 5:28pm ~]# zpool status zdata
  pool: zdata
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 10:26:05 with 0 errors on Wed Mar  6 18:06:05 2024
config:

        NAME                   STATE     READ WRITE CKSUM
        zdata                  DEGRADED     0     0     0
          raidz3-0             DEGRADED     0     0     0
            gpt/data_disk10    OFFLINE     88   301     0
            gpt/data_disk11    ONLINE       0     0     0
            gpt/data_disk12_1  ONLINE       0     0     0
            gpt/data_disk13_1  ONLINE       0     0     0
            gpt/data_disk14    ONLINE       0     0     0
            gpt/data_disk15_1  ONLINE       0     0     0
            gpt/data_disk16    ONLINE       0     0     0
            gpt/data_disk17    ONLINE       0     0     0
            gpt/data_disk18_2  ONLINE       0     0     0
            gpt/data_disk19_1  ONLINE       0     0     0
        logs
          mirror-1             ONLINE       0     0     0
            gpt/log0           ONLINE       0     0     0
            gpt/log1           ONLINE       0     0     0
        cache
          gpt/cache0           ONLINE       0     0     0
          gpt/cache1           ONLINE       0     0     0

errors: No known data errors
[root@backup 03.Apr 8:56am ~]#

Says it's OFFLINE here.

Is there a mfi tool I can use to figure out why the system doesn't see it as a mfisyspd device? We're using a LSI 9240-4i controller card...
 
Ah!

Code:
[root@backup 03.Apr 9:34am ~]# mfiutil show drives
mfi0 Physical Drives:
 4 ( 5589G) JBOD <HGST HUS726T6TAL W40H serial=V8H9UVMR> SATA E1:S0
 5 ( 5589G) JBOD <HGST HUS726T6TAL W984 serial=V9G5S89L> SATA E1:S1
 6 ( 5589G) JBOD <HGST HUS726T6TAL W984 serial=V9HTSEEL> SATA E1:S2
 7 ( 5589G) JBOD <HGST HUS726T6TAL W40H serial=V8H9DH1R> SATA E1:S3
 8 ( 5589G) JBOD <HGST HUS726T6TAL W984 serial=V9G81MWL> SATA E1:S4
 9 ( 5589G) JBOD <HGST HUS726T6TAL W40H serial=V8H9US9R> SATA E1:S5
10 ( 5589G) JBOD <HGST HUS726T6TAL W40H serial=V8H9V1LR> SATA E1:S6
11 ( 5589G) JBOD <HGST HUS726T6TAL W984 serial=V9H3L26L> SATA E1:S7
12 ( 5589G) JBOD <HGST HUS726T6TAL W984 serial=V8KZXZWF> SATA E1:S8
16 ( 5589G) JBOD <HGST HUS726T6TAL W9G0 serial=V9KUGVVL> SATA E1:S9
[root@backup 03.Apr 9:34am ~]#

Seems the adapter recognizes the new drive- it's the last drive on the list.

Why can't I successfully use the following:

Code:
[root@backup 03.Apr 9:34am ~]# gpart create -s gpt mfisyspd9
gpart: arg0 'mfisyspd9': Invalid argument
[root@backup 03.Apr 9:36am ~]#
 
Turns out that we needed to reboot the server in order for the system to recognize the new drive as mfisyspd9.

Now resilvering away!
 
My personal experience is that mfi driver is not very well done and it appears to be sort of "abandonware" by now.
I don't know what your hardware is, but if it's supported by mrsas then it's better to switch.
mrsas supports dynamic events much better.
 
It's a 10 year old Supermicro server. Not sure if it can be upgraded to mrsas in place, can it? The motherboard is a Supermicro X9DAi-iF.
 
The following is the zdata array in a FreeBSD 13.2-RELEASE system:

Code:
HD Device     Passthrough Device  Serial     GPT               Model                   Firmware  Slot #
mfisyspd0     pass0             V8H9UVMR     data_disk11       HGST HUS726T6TALE6L4     40H      0
mfisyspd1     pass1             V9G5S89L     data_disk12_1     HGST HUS726T6TALE6L4     984      1
mfisyspd2     pass2             V9HTSEEL     data_disk13_1     HGST HUS726T6TALE6L4     984      2
mfisyspd3     pass3             V8H9DH1R     data_disk14       HGST HUS726T6TALE6L4     40H      3
mfisyspd4     pass4             V9G81MWL     data_disk15_1     HGST HUS726T6TALE6L4     460      4
mfisyspd5     pass5             V8H9US9R     data_disk16       HGST HUS726T6TALE6L4     40H      5
mfisyspd6     pass6             V8H9V1LR     data_disk17       HGST HUS726T6TALE6L4     40H      6
mfisyspd7     pass7             V9H3L26L     data_disk18_2     HGST HUS726T6TALE6L4     984      7
mfisyspd8     pass8             V8KZXZWF     data_disk19_1     HGST HUS726T6TALE6L4     984      8
mfisyspd9     pass10            V8H9G39R     data_disk10       HGST HUS726T6TALE6L4     40H      9
What CLI command did you use to get that wonderful output?
 
Nothing magical- I collate these in a wiki based on outputs from a variety of CLI commands.
 
I'm particularly interested in the "Slot #" heading. Is the slot # = SATA or SAS port number on the mobo?

Could you show me all the CLI commands you use for this so I can create a similar spreadsheet entry?

Identifying which drive tossed an error can be challenging sometimes.

Thanks.
 
Since the LSI 2940-4i controller card is used here, I used the mfiutil command. I also used smartctl and gpart CLI to derive additional information. From there on I was able to organize the data into a meaningful summary to assist with disaster recovery.

# mfiutil show drives
# smartctl -i /dev/pass10 -d sat,0
# gpart show

The above commands are what I used to get the desired data.

See https://man.freebsd.org/cgi/man.cgi?query=mfiutil&sektion=8&manpath=FreeBSD+8.0-RELEASE for more info.
 
Back
Top