ZFS Weird and slow ZFS resilver

spork · Sep 17, 2018

This pool is in quite a state - 4 2TB WD RE3 drives, of which two were failing (now three) plus two 4TB RE4 drives (both healthy). Remote hands put two new drives in on Friday and while that went well, last night the resilver/scan "finished" but the two mirrors were still marked "DEGRADED", perhaps because there was a single error reported (one file in a snapshot). On running a "zpool clear zroot" to get rid of the error, the resilver/scan started over... Ugh.

So help me out with a few thing here that I'm not following:

Code:

  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep 17 03:01:18 2018
    2.28T scanned out of 6.82T at 1.46K/s, (scan is slow, no estimated time)
        1.24T resilvered, 33.50% done
config:

    NAME                       STATE     READ WRITE CKSUM
    zroot                      DEGRADED     0     0     0
      mirror-0                 DEGRADED     0     0     0
        gpt/zdisk0             ONLINE       0     0     0
        replacing-1            DEGRADED     0     0     0
          3337922232420531863  REMOVED      0     0     0  was /dev/gpt/zdisk1/old
          gpt/zdisk1           ONLINE       0     0     0  block size: 512B configured, 4096B native
      mirror-1                 DEGRADED     0     0     0
        replacing-0            DEGRADED     0     0     0
          8989923517117392608  REMOVED      0     0     0  was /dev/gpt/zdisk2/old
          gpt/zdisk2           ONLINE       0     0     0  block size: 512B configured, 4096B native
        gpt/zdisk3             ONLINE       0     0     0
      mirror-3                 ONLINE       0     0     0
        gpt/zdisk5             ONLINE       0     0     0
        gpt/zdisk4             ONLINE       0     0     0
    logs
      gpt/zil0                 ONLINE       0     0     0

errors: No known data errors

I see that we're now warned on doing something stupid (putting 4K drives in a 512B pool), and that's nice.

What is confusing me here is that while the general status command says that a resilver is in progress, the individual vdevs do not have "(resilvering)" next to them. I don't know if that's indicating that something is wrong or if this is just a change in ZFS that I've not noticed. Also the speed - if I look at gstat, the drives are certainly quite busy (and the drives that aren't resilvering are almost idle).

Any pointers here? I don't have enough capacity anywhere to copy this all over elsewhere and recreate the pool (wish I could as when this finishes and I drop in two more new drives they'll all be 4K sector drives). My feeling is that when the resilver hits a bad block on one of the dicey drives ("zdisk3") zfs gets really confused and instead of skipping it and moving forward it sort of never feels the resilver has completed.

And a lesson learned - yeah, multiple drives bought at the same time certainly can all start showing bad blocks within a week of each other.

phoenix · Sep 18, 2018

Just "zpool detach" the two drives via the long ID number shown there.

If that doesn't work, you should be able to "zpool remove" them.

spork · Sep 18, 2018

Oh, also I forgot to mention: 11.2-RELEASE.

So "zpool detach" got rid of those old devices and the output looks more normal. But I'm still totally confused on what it's doing. The resilver process restarted when the drives were removed, but for the life of me I can't figure out *what* it's resilvering or why:

Code:

  pool: zroot
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Sep 18 00:18:46 2018
    20.0G scanned out of 5.87T at 28.5M/s, 59h49m to go
        12.7G resilvered, 0.33% done
config:

    NAME            STATE     READ WRITE CKSUM
    zroot           ONLINE       0     0     0
      mirror-0      ONLINE       0     0     0
        gpt/zdisk0  ONLINE       0     0     0
        gpt/zdisk1  ONLINE       0     0     0  block size: 512B configured, 4096B native
      mirror-1      ONLINE       0     0     0
        gpt/zdisk2  ONLINE       0     0     0  block size: 512B configured, 4096B native
        gpt/zdisk3  ONLINE       0     0     0
      mirror-3      ONLINE       0     0     0
        gpt/zdisk5  ONLINE       0     0     0
        gpt/zdisk4  ONLINE       0     0     0
    logs
      gpt/zil0      ONLINE       0     0     0

errors: No known data errors

Everything is "ONLINE", nothing is "DEGRADED", but it's resilvering/scanning... I'd really like it to stop banging on "zdisk3", as that one is not long for this world. I have two drives ready to go (to replace "zdisk3" and "zdisk0"), but I have no idea what would happen if I pulled those drives while it thinks it's resilvering something.

Data is happening:

Code:

dT: 1.003s  w: 1.000s
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    2    207    207  15173   12.8      0      0    0.0   99.9  ada0
    1    176      0      0    0.0    176  15074    1.8   31.4  ada1
    0    188      0      0    0.0    188  18480    2.0   37.2  ada2
    2    193    193  18646   12.8      0      0    0.0   99.8  ada3
    0     11     11     63   11.2      0      0    0.0    4.3  ada4
    0     11     11     64    9.0      0      0    0.0    3.8  ada5
    0      0      0      0    0.0      0      0    0.0    0.0  da0

The "old" drives, ada0 (zdisk0) and ada3 (zdisk3) are generally near 100% busy with reads, the new drives, ada1 (zdisk1) and ada2 (zdisk2) are a bit less busy and with writes. That suggests data is being copied from old to new (again).

hedwards · Sep 18, 2018

spork said:
pool: zroot
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Sep 18 00:18:46 2018
20.0G scanned out of 5.87T at 28.5M/s, 59h49m to go
12.7G resilvered, 0.33% done
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/zdisk0 ONLINE 0 0 0
gpt/zdisk1 ONLINE 0 0 0 block size: 512B configured, 4096B native
mirror-1 ONLINE 0 0 0
gpt/zdisk2 ONLINE 0 0 0 block size: 512B configured, 4096B native
gpt/zdisk3 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gpt/zdisk5 ONLINE 0 0 0
gpt/zdisk4 ONLINE 0 0 0
logs
gpt/zil0 ONLINE 0 0 0

errors: No known data errors

That is kind of odd.

Why do those drives have 512b sectors rather than 4k? It seems odd to me that you're using 512b sector size on just those disks. I use that same model of HDD and I've always used the 4k sector size that matches the native sector size.

If you're using smaller sector sizes, it seems reasonable to expect performance degradation to result. I'd try removing those disks and changing the sector size to match the 4k that the drives want.

FWIW, I've done a lot of resilvering and scrubbing with that model of disk recently and it tends to start out slow like that and get faster as the process goes on. Usually topping out at around 55M/s, but that's going to depend a bit on your machine, mine is relatively old with only 16gigs of RAM.

spork · Sep 18, 2018

hedwards said:
That is kind of odd.

Why do those drives have 512b sectors rather than 4k? It seems odd to me that you're using 512b sector size on just those disks. I use that same model of HDD and I've always used the 4k sector size that matches the native sector size.

512B is the native size on those drives, they're pretty old. The new ones are 4K.

Speed is ramped up to maybe 50MB/s, but the thing that bugs me is that the status output is not showing the resilvering happening on any particular vdev. Never seen that before. Like, what happens if my 3rd failing drive gives up the ghost? Will ZFS decide that my pool is hot garbage and refuse to boot? And if so, why? It's already completed at least one resilver and part of the status output currently states nothing is "DEGRADED". It's all just a bit too weird for comfort.

spork · Sep 20, 2018

It finished, but it really feels like there's a bug in either the zpool command or something deeper in zfs that the "status" command is not showing the resilver. My gut feeling is that when the ashift warnings were added, someone goofed up the code that prints status...

Do any zfs committers pop in the forums ever? I have two more drives to replace, so I can reproduce and debug if I have someone interested in fixing...

k.jacker · Sep 21, 2018

I think the best approach is to create a new 4k pool.
To my knowledge you can't add 4k drives to a 512b pool (but look like it's possible at least).
Running 512b drives in a 4k pool works well.

Use gpart create -a 4k -t freebsd-zfs ... on all drives, 4k and 512b ones.
Set sysctl vfs.zfs.min_auto_ashift=12 before creating the new pool.
Move your data...
That's what I would do.

spork · Sep 22, 2018

k.jacker said:
Move your data...
That's what I would do.

Yeah, I get all that. I currently don't have room to move this elsewhere and recreate the pool. Other than performance issues, I don't think there's any functionality concerns with mixed sector-sizes.

Multiple problems, and only one remains:

Resilvering is shown as active, but the per-device notice of resilvering is not being show
Resilvering is slow, but that's mainly due to the mismatch in sector sizes
Resilvering is slow because one of the source drives has bad sectors
I didn't realize I had to manually detach a "zpool replace"'d drive

The resilvering finished on the first two drives without issue and I have two additional drives resilvering right now. The thing that sent me down all the wrong paths was that the drives being resilvered are not identified as such. I'm opening a bug in bugzilla for that since it's clearly not the right behavior.