Solved degraded zpool advise

fred974 · Feb 18, 2019

Hi,

Today I got the following email from Zabbix:

Code:

Trigger: zroot is in a DEGRADED state!
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:

Item values:

1. Health on zroot (r610.trinitech.co.uk:vfs.zpool.get[zroot,health]): DEGRADED
2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*

Original event ID: 19178

As a result, I checked with zpool status and got this:

Code:

  pool: zroot
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 772K in 5h17m with 0 errors on Mon Feb 11 14:23:17 2019
config:

        NAME                      STATE     READ WRITE CKSUM
        zroot                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            mfid0p3               ONLINE       0     0     0
            mfid1p3               ONLINE       0     0     0
            mfid2p3               ONLINE       0     0     0
            mfid3p3               ONLINE       0     0     0
            mfid4p3               ONLINE       0     0     0
            11019374657424073610  REMOVED      0     0     0  was /dev/mfid5p3

errors: No known data errors

I am the only person accessing this server and I didn't remove the disk so I am assuming it is dead

will it harm to run zpool online to see if it find the disk?
Could someone please confirm that my command below is correct to replace the failed drive?

zpool replace zroot 11019374657424073610 mfid5p3
I have a couple of old disks but no new one, will the resilvering process wipe the disk as part of the process or do I have to wipe the disk myself?

Thank you

fred974 · Feb 18, 2019

Do I run the zpool replace command before o replaced the disk or after?
I also forgot to mentioned is it a ZFS Root Pool

SirDice · Feb 18, 2019

fred974 said:
I have a couple of old disks but no new one, will the resilvering process wipe the disk as part of the process or do I have to wipe the disk myself?

Wipe it yourself. And don't forget to (re)create the partition tables.

fred974 · Feb 18, 2019

SirDice said:
And don't forget to (re)create the partition tables.

Ha..forgot about that.
So is the procedure bellow now correct?

Step1 - Replace faulty Disk with a new one
Step2 - Partition the new disk

Code:

gpart create -s gpt mfid5
gpart add -a 4k -s 512k -t freebsd-boot mfid5
gpart add -a 4k -s 1T -t freebsd-zfs -l disk5 mfid5
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 mfid5

Step3 - Put disk offline
zpool offline zroot mfid5p3
Step4 - Tell ZFS about the new drive
zpool replace zroot 11019374657424073610 mfid5p3

Could you please help me understand if the above command is correct to create partition table?
gpart show return

Code:

=>       40  285474736  mfid0  GPT  (136G)
         40       1024      1  freebsd-boot  (512K)
       1064        984         - free -  (492K)
       2048    4194304      2  freebsd-swap  (2.0G)
    4196352  281276416      3  freebsd-zfs  (134G)
  285472768       2008         - free -  (1.0M)

=>       40  285474736  mfid1  GPT  (136G)
         40       1024      1  freebsd-boot  (512K)
       1064        984         - free -  (492K)
       2048    4194304      2  freebsd-swap  (2.0G)
    4196352  281276416      3  freebsd-zfs  (134G)
  285472768       2008         - free -  (1.0M)

=>       40  285474736  mfid2  GPT  (136G)
         40       1024      1  freebsd-boot  (512K)
       1064        984         - free -  (492K)
       2048    4194304      2  freebsd-swap  (2.0G)
    4196352  281276416      3  freebsd-zfs  (134G)
  285472768       2008         - free -  (1.0M)

=>       40  285474736  mfid3  GPT  (136G)
         40       1024      1  freebsd-boot  (512K)
       1064        984         - free -  (492K)
       2048    4194304      2  freebsd-swap  (2.0G)
    4196352  281276416      3  freebsd-zfs  (134G)
  285472768       2008         - free -  (1.0M)

=>       40  285474736  mfid4  GPT  (136G)
         40       1024      1  freebsd-boot  (512K)
       1064        984         - free -  (492K)
       2048    4194304      2  freebsd-swap  (2.0G)
    4196352  281276416      3  freebsd-zfs  (134G)
  285472768       2008         - free -  (1.0M)

fred974 · Feb 18, 2019

What about this command?
gpart backup mfid4 | gpart restore -F mfid5

gkontos · Feb 18, 2019

Either would work the same

fred974 · Feb 18, 2019

gpart backup mfid4 | gpart restore -F mfid5
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 mfid5
zpool offline zroot mfid5p3
zpool replace zroot 11019374657424073610 mfid5p3

and I should be ok, right?

gkontos nice to see you again

gkontos · Feb 18, 2019

Yes, but you might want to run sysutils/smartmontools on the disk that has failed. I had similar issues recently and it turned out that it was a faulty power supply.

Bobi B. · Feb 18, 2019

Most likely you'll find an event that disk dropped from the bus in your kernel message buffer (dmesg(8)).

fred974 · Feb 18, 2019

I went to the Data Centre and replace the faulty drive
gpart create -s gpt mfid5 return

Code:

gpart: arg0 'mfid5': Invalid argument

more /var/run/dmesg.boot | grep mfid

Code:

mfid0 on mfi0
mfid0: 139392MB (285474816 sectors) RAID volume 'Disk0' is optimal
mfid1 on mfi0
mfid1: 139392MB (285474816 sectors) RAID volume 'Disk1' is optimal
mfid2 on mfi0
mfid2: 139392MB (285474816 sectors) RAID volume 'Disk2' is optimal
mfid3 on mfi0
mfid3: 139392MB (285474816 sectors) RAID volume 'Disk3' is optimal
mfid4 on mfi0
mfid4: 139392MB (285474816 sectors) RAID volume 'Disk4' is optimal
mfid5 on mfi0
mfid5: 139392MB (285474816 sectors) RAID volume 'Disk5' is optimal

According to the above the name is correct.

Where am I going wrong?

Thank you

fred974 · Feb 18, 2019

Bobi B.
I can see a 'detached' error
dmesg | grep mfid

Code:

[4088370] mfid3: hard error cmd=read 120842368-120842375
[4088688] mfid3: hard error cmd=read 121529000-121529007
[4287749] mfid3: hard error cmd=read 131831560-131831567
[4665924] mfid3: hard error cmd=read 119035528-119035559
[4666810] mfid3: hard error cmd=read 133877400-133877407
[4668277] mfid3: hard error cmd=read 125216152-125216159
[4670233] mfid3: hard error cmd=read 131851416-131851535
[4674287] mfid3: hard error cmd=read 134222104-134222287
[4674298] mfid3: hard error cmd=read 66180832-66180839
[4674846] mfid3: hard error cmd=read 124604200-124604215
[4675401] mfid3: hard error cmd=read 129997344-129997423
[4677565] mfid3: hard error cmd=read 128941680-128941703
[4677800] mfid3: hard error cmd=read 127419504-127419527
[4677800] mfid3: hard error cmd=read 127419248-127419503
[4678426] mfid3: hard error cmd=read 131096976-131097207
[4678444] mfid3: hard error cmd=read 131186944-131187055
[4679286] mfid3: hard error cmd=read 133913624-133913631
[4679571] mfid3: hard error cmd=read 134283920-134283943
[4680155] mfid3: hard error cmd=read 120349000-120349255
[4683062] mfid3: hard error cmd=read 133883872-133883943
[4683147] mfid3: hard error cmd=read 128403152-128403351
[4683725] mfid3: hard error cmd=read 128700872-128700999
[4683794] mfid3: hard error cmd=read 128658504-128658543
[4683821] mfid3: hard error cmd=read 129632456-129632479
[4684426] mfid3: hard error cmd=read 127772432-127772471
[4972766] mfid5: hard error cmd=read 96548384-96548391
[4972768] mfid5: hard error cmd=write 97780920-97780927
[4972768] mfid5: hard error cmd=write 97781200-97781319
[4972768] mfid5: hard error cmd=read 4196880-4196895
[4972768] mfid5: hard error cmd=read 285471760-285471775
[4972768] mfid5: hard error cmd=read 285472272-285472287
[4972768] mfid5: hard error cmd=write 97780928-97780935
[4972768] mfid5: detached

Can i get more details as to why it dropped?

SirDice · Feb 18, 2019

Code:

mfid5: 139392MB (285474816 sectors) RAID volume 'Disk5' is optimal

Look like you configured each disk for a RAID0 individually instead of using JBOD.

fred974 · Feb 18, 2019

SirDice,

now you mention it, yes I did and I didn't know better back then..
So do I need to shut down the server and set the new disk as raid 0 and then
gpart backup mfid4 | gpart restore -F mfid5
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 mfid5
zpool offline zroot mfid5p3
zpool replace zroot 11019374657424073610 mfid5p3

Or is this zpool in serious trouble?

Bobi B. · Feb 18, 2019

fred974 said:
Bobi B.
I can see a 'detached' error
dmesg | grep mfid

Code:

[4088370] mfid3: hard error cmd=read 120842368-120842375 ... [4972768] mfid5: detached

Can i get more details as to why it dropped?

You might try to read controller's event log with mfiutil(8); typing from memory: mfiutil show events.

SirDice said:
Code:

mfid5: 139392MB (285474816 sectors) RAID volume 'Disk5' is optimal

Look like you configured each disk for a RAID0 individually instead of using JBOD.

Unfortunately JBOD is not an option on some older LSI adapters; just RAID0 (we have MR9271 and MR9261 in the office, although they now use mrsas(4) with FreeBSD 11.x).

SirDice · Feb 18, 2019

Tip. If your disks move around (I have an mpt(4) controller that does this). Write down the serial number of the disk before you insert it. Then use smartctl -i /dev/mfid5 to check the serial.

On my controller, if I remove da2 for example, the remaining disks all change device numbers. Then when you insert the new disk it's suddenly da4 instead of the expected da2. First time this happened to me I inadvertently wiped the wrong disk and broke the whole pool.

Now I double-check the drive's serial number so I'm sure I'm wiping and replacing the correct disk.

fred974 · Feb 18, 2019

SirDice said:
Now I double-check the drive's serial number so I'm sure I'm wiping and replacing the correct disk.

which command do I run to wipe the drive?
the gpart cmd i ran didn't work

SirDice · Feb 18, 2019

fred974 said:
which command do I run to wipe the drive?

dd if=/dev/zero of=.....

fred974 · Feb 18, 2019

SirDice , I am sorry but I am not following you.
I wiped the disk before outting it in the server.
I need to understand how do i tell zpool that a new disk can now take over

fred974 · Feb 18, 2019

Bobi B. here is the end of the result of mfiutil show events

Code:

42223 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ae 10 00 00 28 00
42224 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ae 48 00 00 20 00
42225 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ad 88 00 00 08 00
42226 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ad f8 00 00 08 00
42227 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ae 00 00 00 08 00
42229 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42231 (Wed Feb 13 08:19:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42233 (Wed Feb 13 09:04:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42234 (Thu Feb 14 08:19:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42236 (Thu Feb 14 09:04:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42237 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f c0 00 00 18 00
42238 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f 70 00 00 10 00
42239 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7c 48 00 00 18 00
42240 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7e 00 00 00 20 00
42241 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7d f8 00 00 08 00
42242 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7b e8 00 00 18 00
42243 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7e f0 00 00 10 00
42244 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f e8 00 00 10 00
42245 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f 58 00 00 10 00
42246 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7c b0 00 00 20 00
42247 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f 00 00 00 08 00
42248 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f 10 00 00 40 00
42249 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7e 50 00 00 10 00
42250 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7c 68 00 00 10 00
42251 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f 80 00 00 28 00
42252 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7c 00 00 00 18 00
42253 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7e b8 00 00 08 00
42254 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - PD 05(e0x20/s5) Path 5000cca0153aaef9  reset (Type 03)
42255 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 82 98 00 00 20 00
42256 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 30 00 00 08 00
42257 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 82 e8 00 00 10 00
42258 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 e8 00 00 10 00
42259 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 a0 00 00 10 00
42260 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 70 00 00 10 00
42261 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 b8 00 00 20 00
42262 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 80 f8 00 00 08 00
42263 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 80 80 00 00 18 00
42264 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 48 00 00 18 00
42265 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 83 08 00 00 38 00
42266 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 80 00 00 10 00
42267 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 80 78 00 00 08 00
42268 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 82 00 00 00 60 00
42269 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 00 00 00 10 00
42270 (Thu Feb 14 22:20:40 GMT 2019/DRIVE/WARN) - Removed: PD 05(e0x20/s5)
42274 (Thu Feb 14 22:20:40 GMT 2019/0x0021/FATAL) - VOL 5 event: Controller cache pinned for missing or offline VD 05/5
42275 (Thu Feb 14 22:20:40 GMT 2019/VOLUME/FATAL) - VOL 5 event: VD 05/5 is now OFFLINE
42286 (Thu Feb 14 22:22:48 GMT 2019/ENCL/CRIT) - Enclosure PD 20(c None/p0) phy bad for slot 5

Bobi B. · Feb 18, 2019

fred974 said:

Bobi B. here is the end of the result of mfiutil show events

Code:

42223 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ae 10 00 00 28 00
42224 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Command timeout on PD 03(e0x20/s3) Path 500000e114cecc12, CDB: 2a 00 08 1b ae 48 00 00 20 00
...
42229 (Tue Feb 12 12:39:01 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42231 (Wed Feb 13 08:19:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42233 (Wed Feb 13 09:04:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42234 (Thu Feb 14 08:19:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42236 (Thu Feb 14 09:04:06 GMT 2019/DRIVE/WARN) - Predictive failure: PD 03(e0x20/s3)
42237 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7f c0 00 00 18 00
...
42253 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 7e b8 00 00 08 00
42254 (Thu Feb 14 22:20:35 GMT 2019/DRIVE/WARN) - PD 05(e0x20/s5) Path 5000cca0153aaef9  reset (Type 03)
42255 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 82 98 00 00 20 00
...
42269 (Thu Feb 14 22:20:37 GMT 2019/DRIVE/WARN) - Command timeout on PD 05(e0x20/s5) Path 5000cca0153aaef9, CDB: 2a 00 0d 92 81 00 00 00 10 00
42270 (Thu Feb 14 22:20:40 GMT 2019/DRIVE/WARN) - Removed: PD 05(e0x20/s5)
42274 (Thu Feb 14 22:20:40 GMT 2019/0x0021/FATAL) - VOL 5 event: Controller cache pinned for missing or offline VD 05/5
42275 (Thu Feb 14 22:20:40 GMT 2019/VOLUME/FATAL) - VOL 5 event: VD 05/5 is now OFFLINE
42286 (Thu Feb 14 22:22:48 GMT 2019/ENCL/CRIT) - Enclosure PD 20(c None/p0) phy bad for slot 5

Well, I'm not that knowledgable, but I believe messages speak for themselves. Usually the disk stucks on an I/O request, most likely unsuccessful read, and stops executing other requests. When the disk is directly attached (i.e. not hid behind a RAID controller, part of a hardware RAID volume), you'll see its queue length (L(q)) in gstat(8) increase. Then I/O operations to this volume will freeze and affected processes will block in the kernel. In the unlucky case that is your root volume, well, you're toast. When such disk is part of a hardware RAID volume, RAID controller usually drops the disk, after some timeout, or waiting for a successful read will block the whole volume indefinitely.

The problem with mfi(4) is, that disks are not directly visible, hence their health (relocated sectors, other errors, etc.) cannot be monitored with smartctl(8). Also notice that there is predictive failure notification for PD 03, as well. How old are those disks? Can you shutdown the host and check their health/metrics, with smartctl, in another box?

fred974 · Feb 18, 2019

Hi the whole lot is old.
I just bought a new server to move everything across but look like I am running out of time.
I have a server with direct HBA to the disks.
Would I be better to zfs send the entire server to a new server that has proper zfs HBA?

Bobi B. · Feb 18, 2019

I have no experience with zfs send; I tend to use rsync(1), as it supports full & incremental copies. Perhaps SirDice would give a better advise?

phoenix · Feb 19, 2019

fred974 said:
Ha..forgot about that.
So is the procedure bellow now correct?

Step1 - Replace faulty Disk with a new one
Step2 - Partition the new disk

Code:

gpart create -s gpt mfid5 gpart add -a 4k -s 512k -t freebsd-boot mfid5 gpart add -a 4k -s 1T -t freebsd-zfs -l disk5 mfid5 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 mfid5

Step3 - Put disk offline
zpool offline zroot mfid5p3
Step4 - Tell ZFS about the new drive
zpool replace zroot 11019374657424073610 mfid5p3

Step 3 is not needed. The rest looks good.

SirDice · Feb 20, 2019

Bobi B. said:
I have no experience with zfs send; I tend to use rsync(1), as it supports full & incremental copies. Perhaps SirDice would give a better advise?

I rarely use zfs send/receive. At home I like to live dangerously and rarely backup anything. I don't have anything important enough to backup any way. At work it's mostly UFS due to hardware RAIDs. And most of those aren't backed up either (almost everything is set up automatically with Puppet and can be reinstalled in a heartbeat), the actual website code is handled by the developers (Github is the main source I believe with regular code backups stored 'offline' by one of the guys that maintains everything). The databases are on several servers, online at the datacenter and one off-site at the office. Disaster recovery is possible but copying the databases back is going to take quite some time (in the mean time I would have plenty of time to restore almost all functionality from scratch).

fred974 · Feb 21, 2019

gkontos said:
Yes, but you might want to run sysutils/smartmontools on the disk that has failed. I had similar issues recently and it turned out that it was a faulty power supply.

You were right. the server backplane was not working for the 6th HDD.
We replaced the server with direct HBA this time..(lesson learnt).
As far as restoring the services, as we use cbsd jail it was easier to migrate the jails to the new server (CBSD node) than using zfs(as i'm not experienced enough)

Solved degraded zpool advise

fred974

fred974

SirDice

Administrator

fred974

fred974

gkontos

fred974

gkontos

Bobi B.

fred974

fred974

SirDice

Administrator

fred974

Bobi B.

SirDice

Administrator

fred974

SirDice

Administrator

fred974

fred974

Bobi B.

fred974

Bobi B.

phoenix

SirDice

Administrator

fred974