ZFS Disk Failed - mfiutil

Hi,

We have FreeBSD 9.1, and one of our drives has failed in the RAID-1 array. What is the proper way to replace the failed drive, while the system is still running? If the second drive fails, we lose the OS and won't be able to boot our server.
1.) Figure out which physical slot the failed drive is located in
2.) Command to replace the drive
3.) Command to replicate/rebuild the raid array

The main filesystem / is running on the RAID-1 array. Here is the output, let me know if you need any other information;
Code:
root # mfiutil show volumes
mfi0 Volumes:
  Id     Size    Level   Stripe   State   Cache   Name
 mfid0 (  930G) RAID-1      64k DEGRADED Enabled
 mfid1 ( 3725G) RAID-0      64k OPTIMAL  Enabled
 mfid2 ( 3725G) RAID-0      64k OPTIMAL  Enabled
 mfid3 ( 3725G) RAID-0      64k OPTIMAL  Enabled
 mfid4 ( 3725G) RAID-0      64k OPTIMAL  Enabled
     3 ( 3725G) RAID-0      64k OFFLINE  Enabled
mfid18 ( 3725G) RAID-0      64k OPTIMAL  Writes
mfid19 ( 3725G) RAID-0      64k OPTIMAL  Writes
mfid20 ( 3725G) RAID-0      64k OPTIMAL  Writes
mfid21 ( 3725G) RAID-0      64k OPTIMAL  Writes

root # mfiutil show drives
mfi0 Physical Drives:
 0 (  931G) FAILED <ST31000524AS JC4B serial=5VPDLEQ8> SATA E1:S0
 1 (  931G) ONLINE <ST31000524AS JC4B serial=5VPDLPGZ> SATA E1:S1
 2 ( 3726G) ONLINE <WL4000GSA6472E\011 1KX1 serial=WOL240256793\011> SATA E1:S2
 3 ( 3726G) ONLINE <WL4000GSA6472E 1KX0 serial=WOL240241285> SATA E1:S3
 4 ( 3726G) FAILED <ST4000DM000-1F21 CC52 serial=W300ANQ8> SATA E1:S4
 5 ( 3726G) ONLINE <WL4000GSA6472E 1KX0 serial=WOL240256926> SATA E1:S5
 6 ( 3726G) ONLINE <ST4000DM000-1F21 CC54 serial=Z3015J64> SATA E1:S6
 7 ( 3726G) ONLINE <ST4000DM000-1F21 CC54 serial=Z3015L8E> SATA E1:S7
 8 ( 3726G) ONLINE <WL4000GSA6472E 1KX1 serial=WOL240256967> SATA E1:S8
 9 ( 3726G) ONLINE <WL4000GSA6472E HP00 serial=WOL240241417> SATA E1:S9
10 ( 3726G) ONLINE <WL4000GSA6472E\011 1KX0 serial=WOL240256966\011> SATA E1:S10
The other failed drive is part of a zpool;
Code:
root # zpool status
  pool: sysvol
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Mar 14 08:45:01 2018
        2.13T scanned out of 12.8T at 266M/s, 11h42m to go
        272G resilvered, 16.62% done
config:

        NAME                       STATE     READ WRITE CKSUM
        sysvol                     DEGRADED     0     0     0
          raidz2-0                 DEGRADED     0     0     0
            mfid1                  ONLINE       0     0     0
            mfid2                  ONLINE       0     0     0
            mfid3                  ONLINE       0     0     0
            mfid4                  ONLINE       0     0     0
            spare-4                REMOVED      0     0     0
              3265633713998955857  REMOVED      0     0     0  was /dev/mfid5
              mfid21               ONLINE       0     0     0  (resilvering)
            mfid18                 ONLINE       0     0     0
            mfid19                 ONLINE       0     0     0
            mfid20                 ONLINE       0     0     0
        logs
          ada0                     ONLINE       0     0     0
        spares
          11427004879980126793     INUSE     was /dev/mfid21
 
We have FreeBSD 9.1
FreeBSD 9.1 has been End-of-Life since December 2014 and is not supported any more. Please upgrade to a supported version as soon as possible.
Topics about unsupported FreeBSD versions
https://www.freebsd.org/security/unsupported.html


mfiutil locate "E1:S0" on, then look for a blue (flashing?) LED on the enclosure. Pull the drive, replace the drive. You'll want to check disk "E1:S4" too, it's also in a FAILED state and causing the ZFS pool to be DEGRADED. Same thing, use the locator LED to find the drive, pull it and replace it.
 
Thanks for the info! Do I have to issue any command, or will the drive (E1:S0) automatically rebuild itself? As far as E1:S4, that is part of the zpool, so would I just issue a 'zpool replace mfid5 mfid5'? Assuming the new drive is set to mfid5.
 
Most of the time it'll start rebuilding by itself. I've only had a couple of instances when it didn't.
As far as E1:S4, that is part of the zpool, so would I just issue a 'zpool replace mfid5 mfid5'? Assuming the new drive is set to mfid5.
Careful with the replace command, note that it's currently trying to resilver using a spare disk, so I would just offline the bad drive, replace it, then "online" it again. If everything goes as planned things usually rebuild themselves and the used spare disk will become an available spare again.
 
Thanks for the help! I ordered replacement drives, and will be replacing them next week.

Just to be clear on the process, I should do the following;
-mfiutil locate E1:S4 on
-zpool offline sysvol mfid5
-pull out the drive that has the blue light, replace it, plug it back in
-zpool online sysvol mfid5

The E1:S4 drive was showing up as mfid5 when I ran 'mfiutil show volumes', now it shows as 3. When I plug the new drive back into the same slot, will it should up as mfid5 again? Also, what is mfiutil? I checked on another FreeBSD server, and when I run 'mfiutil show volumes', I get 'mfi_open: No such file or directory.' When I replaced a failed drive on that server, I used 'zpool sysvol replace faileddiskname'.

Sorry for all the questions, just don't want to screw something up and lose data.
 
When I plug the new drive back into the same slot, will it should up as mfid5 again?
Likely, but it may shuffle disks around in the process though.

I checked on another FreeBSD server, and when I run 'mfiutil show volumes', I get 'mfi_open: No such file or directory.
The mfiutil(8) only works on mfi(4) controllers. There's also mptutil(8) for mpt(4), mpsutil(8) for mps(4) and mprutil(8) for mpr(4). If you want something that works for all LSI based controllers use sysutils/megacli.

Sorry for all the questions, just don't want to screw something up and lose data.
Mirrors or RAID are no substitute for good backups of course, so I hope you have them. There's always a bit of risk involved.
 
I replaced the drives. E1:S0 (in RAID-1) is currently rebuilding.

I ran 'zpool offline sysvol mfid5', replaced the drive, then ran 'zpool online sysvol mfid5' and got the following error;
warning: device 'mfid5' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

When I run 'mfiutil show drives' I get the following;
mfi0 Physical Drives:
0 ( 931G) REBUILD <ST31000524AS HP64 serial=9VPDE9XB> SATA E1:S0
1 ( 931G) ONLINE <ST31000524AS JC4B serial=5VPDLPGZ> SATA E1:S1
2 ( 3726G) ONLINE <WL4000GSA6472E\011 1KX1 serial=WOL240256793\011> SATA E1:S2
3 ( 3726G) ONLINE <WL4000GSA6472E 1KX0 serial=WOL240241285> SATA E1:S3
4 ( 3726G) UNCONFIGURED GOOD <ST4000DM000-1F21 CC54 serial=Z300PTNK> SATA E1:S4
5 ( 3726G) ONLINE <WL4000GSA6472E 1KX0 serial=WOL240256926> SATA E1:S5
6 ( 3726G) ONLINE <ST4000DM000-1F21 CC54 serial=Z3015J64> SATA E1:S6
7 ( 3726G) ONLINE <ST4000DM000-1F21 CC54 serial=Z3015L8E> SATA E1:S7
8 ( 3726G) ONLINE <WL4000GSA6472E 1KX1 serial=WOL240256967> SATA E1:S8
9 ( 3726G) ONLINE <WL4000GSA6472E HP00 serial=WOL240241417> SATA E1:S9
10 ( 3726G) ONLINE <WL4000GSA6472E\011 1KX0 serial=WOL240256966\011> SATA E1:S10

And when I run 'mfiutil show volumes' I get the following;
mfi0 Volumes:
Id Size Level Stripe State Cache Name
mfid0 ( 930G) RAID-1 64k DEGRADED Enabled
mfid1 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid2 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid3 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid4 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid18 ( 3725G) RAID-0 64k OPTIMAL Writes
mfid19 ( 3725G) RAID-0 64k OPTIMAL Writes
mfid20 ( 3725G) RAID-0 64k OPTIMAL Writes
mfid21 ( 3725G) RAID-0 64k OPTIMAL Writes

What do I need to do to get E1:S4 from UNCONFIGURED GOOD to ONLINE. And is it possible to get it to go back to the mfid5 volume? And, finally, what zpool commands do I need to run before or after changing the drive in mfiutil?

Thanks!
 
I ran 'zpool offline sysvol mfid5', replaced the drive, then ran 'zpool online sysvol mfid5' and got the following error;
warning: device 'mfid5' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
Then I suggest doing what it says.

What does zpool status tell you regarding the current status of the pool anyway?

Also: am I right to assume that you're basically running a ZFS software raid on top of a hardware raid? Because that could create some weird issues in itself as well.
 
That is correct, it is running ZFS on top of a hardware raid. zpool status shows the following;

pool: sysvol
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 1.60T in 14h35m with 0 errors on Wed Mar 14 23:20:31 2018
config:

NAME STATE READ WRITE CKSUM
sysvol DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
mfid1 ONLINE 0 0 0
mfid2 ONLINE 0 0 0
mfid3 ONLINE 0 0 0
mfid4 ONLINE 0 0 0
spare-4 DEGRADED 0 0 0
3265633713998955857 REMOVED 0 0 0 was /dev/mfid5
mfid21 ONLINE 0 0 0
mfid18 ONLINE 0 0 0
mfid19 ONLINE 0 0 0
mfid20 ONLINE 0 0 0
logs
ada0 ONLINE 0 0 0
spares
11427004879980126793 INUSE was /dev/mfid21

errors: No known data errors

But when I run an mfiutil show volumes, mfid5 isn't listed.
mfi0 Volumes:
Id Size Level Stripe State Cache Name
mfid0 ( 930G) RAID-1 64k OPTIMAL Enabled
mfid1 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid2 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid3 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid4 ( 3725G) RAID-0 64k OPTIMAL Enabled
mfid18 ( 3725G) RAID-0 64k OPTIMAL Writes
mfid19 ( 3725G) RAID-0 64k OPTIMAL Writes
mfid20 ( 3725G) RAID-0 64k OPTIMAL Writes
mfid21 ( 3725G) RAID-0 64k OPTIMAL Writes

And when I run an mfiutil show drives, disk 4 shows a status of UNCONFIGURED GOOD.
mfi0 Physical Drives:
0 ( 931G) ONLINE <ST31000524AS HP64 serial=9VPDE9XB> SATA E1:S0
1 ( 931G) ONLINE <ST31000524AS JC4B serial=5VPDLPGZ> SATA E1:S1
2 ( 3726G) ONLINE <WL4000GSA6472E\011 1KX1 serial=WOL240256793\011> SATA E1:S2
3 ( 3726G) ONLINE <WL4000GSA6472E 1KX0 serial=WOL240241285> SATA E1:S3
4 ( 3726G) UNCONFIGURED GOOD <ST4000DM000-1F21 CC54 serial=Z300PTNK> SATA E1:S4
5 ( 3726G) ONLINE <WL4000GSA6472E 1KX0 serial=WOL240256926> SATA E1:S5
6 ( 3726G) ONLINE <ST4000DM000-1F21 CC54 serial=Z3015J64> SATA E1:S6
7 ( 3726G) ONLINE <ST4000DM000-1F21 CC54 serial=Z3015L8E> SATA E1:S7
8 ( 3726G) ONLINE <WL4000GSA6472E 1KX1 serial=WOL240256967> SATA E1:S8
9 ( 3726G) ONLINE <WL4000GSA6472E HP00 serial=WOL240241417> SATA E1:S9
10 ( 3726G) ONLINE <WL4000GSA6472E\011 1KX0 serial=WOL240256966\011> SATA E1:S10

I am trying to figure out how I can get drive for to show as ONLINE, and configure the drive to an mfid5 volume. If I can do that, I'm guessing I can just offline/online the drive in zpool. Or possible use zpool replace.
 
Back
Top