ZFS resilver starts slow... then speeds up

ethoms · Jun 4, 2014

I just wanted to report a behaviour with a resilver I'm currently doing. It may help somebody else to remain calm and patient during a resilver.

I noticed my backup script failed, then realized my backup pool was degraded. One of the 1TB drives had failed. If I was in the office, I would have tried popping the drive out and in again. Occassionally on my JBOD, the drive stops responding. Popping it out/in again, followed by a zpool online susually resolves it. The drives are usually perfectly OK, just a SAS controller or backplane hicup I guess.

Anyway, since I'm at home, I decided to detach and re-attach instead, since I'm using mirrors in all my vdevs, not raidz. That was a mistake! It would have been faster for me to cycle to office and pop the drive!

I then decided to attach a spare, so that I had redundancy, until the next day, where I could get the original drive attached (hopefully).

So after attaching the spare drive, I noticed it's very slow to resilver. So I know from scrubing that it can speed up if I wait a little while. But after 2 hours, it's still moving extremely slowly. And I'm frustrated because I just want it to mirror the blocks from one drive to another. Being a mirror, that should be simple. After more than 2 hours, it stills says that it will take 600 hours to complete, only having resilvered 1.91G (0.27%). Now I'm thinking "what are my options". And I very nearly decided to update to FreeBSD 8.4 (from 8.3) to get the feature flags 5000, and hopefully a faster resilver.

BUT ALAS, after 10 more minutes whilst I surf about slow zfs resilvering, it seems to speed up dramatically!

Here is my output, the first zpool status was after 2 hours, the second was about 10 mins later:

Code:

zpool status -v backup
  pool: backup
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jun  4 23:44:18 2014
        15.7G scanned out of 5.67T at 2.77M/s, 595h46m to go
        1.91G resilvered, 0.27% done
config:

        NAME                    STATE     READ WRITE CKSUM
        backup                  ONLINE       0     0     0
          mirror-0              ONLINE       0     0     0
            label/array1disk8   ONLINE       0     0     0
            label/array1disk9   ONLINE       0     0     0
          mirror-1              ONLINE       0     0     0
            label/array1disk10  ONLINE       0     0     0
            label/array1disk11  ONLINE       0     0     0
          mirror-2              ONLINE       0     0     0
            label/array1disk12  ONLINE       0     0     0
            label/array1disk13  ONLINE       0     0     0
          mirror-3              ONLINE       0     0     0
            label/array1disk14  ONLINE       0     0     0
            label/array1disk15  ONLINE       0     0     0
          mirror-4              ONLINE       0     0     0
            label/array1disk16  ONLINE       0     0     0
            label/array1disk17  ONLINE       0     0     0
          mirror-5              ONLINE       0     0     0
            label/array1disk18  ONLINE       0     0     0
            label/array1disk19  ONLINE       0     0     0
          mirror-6              ONLINE       0     0     0
            label/array1disk20  ONLINE       0     0     0
            label/array1disk21  ONLINE       0     0     0
          mirror-7              ONLINE       0     0     0
            label/array1disk23  ONLINE       0     0     0
            label/array1disk7   ONLINE       0     0     0  (resilvering)

errors: No known data errors


[root@beastie1 /backup]# zpool status -v backup
  pool: backup
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jun  4 23:44:18 2014
        67.7G scanned out of 5.67T at 11.1M/s, 147h25m to go
        8.27G resilvered, 1.16% done
config:

        NAME                    STATE     READ WRITE CKSUM
        backup                  ONLINE       0     0     0
          mirror-0              ONLINE       0     0     0
            label/array1disk8   ONLINE       0     0     0
            label/array1disk9   ONLINE       0     0     0
          mirror-1              ONLINE       0     0     0
            label/array1disk10  ONLINE       0     0     0
            label/array1disk11  ONLINE       0     0     0
          mirror-2              ONLINE       0     0     0
            label/array1disk12  ONLINE       0     0     0
            label/array1disk13  ONLINE       0     0     0
          mirror-3              ONLINE       0     0     0
            label/array1disk14  ONLINE       0     0     0
            label/array1disk15  ONLINE       0     0     0
          mirror-4              ONLINE       0     0     0
            label/array1disk16  ONLINE       0     0     0
            label/array1disk17  ONLINE       0     0     0
          mirror-5              ONLINE       0     0     0
            label/array1disk18  ONLINE       0     0     0
            label/array1disk19  ONLINE       0     0     0
          mirror-6              ONLINE       0     0     0
            label/array1disk20  ONLINE       0     0     0
            label/array1disk21  ONLINE       0     0     0
          mirror-7              ONLINE       0     0     0
            label/array1disk23  ONLINE       0     0     0
            label/array1disk7   ONLINE       0     0     0  (resilvering)

errors: No known data errors

So, the moral of the story is: zfs and it's scrubs and resilvers (replace/attach) is a complex beast. Perhaps just wait just a bit longer before doing something drastic like rebooting or updating, to get a newer zfs.

Tips to stay relaxed: use 2 drive redundancy (raidz2 or 3-way mirror). That way you can relax during a resilver / scrub.

I use 3-way mirrors on vdevs, with cheap WD blacks to offset the cost on my data pool. The pool in question (above) is my backup pool, so I felt 2-way mirror with WD blue was good enough. I personally don't like raidz, I find it (i) over-complicates, (ii) is less flexible during expansion etc (iii) reduces performance (iv) has limitations (v) more to go wrong. But I'm no ZFS expert, it's just my opinion.

...And yes, I know I should upgrade to 8.4 anyway, I do plan to. It's just that everything is working so well and I don't want to risk it just now. I have 230 days of continuous uptime, barely needing to even restart a service. I want to enjoy that for a bit longer yet.

ethoms · Jun 4, 2014

And.... about another 20 minutes later it's really looking good:

Code:

zpool status -v backup
  pool: backup
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Jun  4 23:44:18 2014
        875G scanned out of 5.67T at 102M/s, 13h47m to go
        109G resilvered, 15.07% done
config:

        NAME                    STATE     READ WRITE CKSUM
        backup                  ONLINE       0     0     0
          mirror-0              ONLINE       0     0     0
            label/array1disk8   ONLINE       0     0     0
            label/array1disk9   ONLINE       0     0     0
          mirror-1              ONLINE       0     0     0
            label/array1disk10  ONLINE       0     0     0
            label/array1disk11  ONLINE       0     0     0
          mirror-2              ONLINE       0     0     0
            label/array1disk12  ONLINE       0     0     0
            label/array1disk13  ONLINE       0     0     0
          mirror-3              ONLINE       0     0     0
            label/array1disk14  ONLINE       0     0     0
            label/array1disk15  ONLINE       0     0     0
          mirror-4              ONLINE       0     0     0
            label/array1disk16  ONLINE       0     0     0
            label/array1disk17  ONLINE       0     0     0
          mirror-5              ONLINE       0     0     0
            label/array1disk18  ONLINE       0     0     0
            label/array1disk19  ONLINE       0     0     0
          mirror-6              ONLINE       0     0     0
            label/array1disk20  ONLINE       0     0     0
            label/array1disk21  ONLINE       0     0     0
          mirror-7              ONLINE       0     0     0
            label/array1disk23  ONLINE       0     0     0
            label/array1disk7   ONLINE       0     0     0  (resilvering)

errors: No known data errors

Of course it still seems a bit slow, but that's just how ZFS works. Pretty cool that it's all done online with very little performance hit on the rest of the server.

I just wish there was a switch to throttle it, since it's out of office hours. Then switch it back to normal after the resilver completes.

Toast · Jun 4, 2014

ethoms said:
Of course it still seems a bit slow, but that's just how ZFS works. Pretty cool that it's all done online with very little performance hit on the rest of the server. I just wish there was a switch to throttle it, since it's out of office hours. Then switch it back to normal after the resilver completes.

Try adjusting these:

Code:

vfs.zfs.resilver_min_time_ms: Min millisecs to resilver per txg
vfs.zfs.resilver_delay: Number of ticks to delay resilver

ethoms · Jun 5, 2014

Thanks for the suggestion toast. However my zpool version and/or kernel/base is too old for those tunables:

Code:

[root@beastie1 ~]# sysctl vfs.zfs.resilver_min_time_ms
sysctl: unknown oid 'vfs.zfs.resilver_min_time_ms'
[root@beastie1 ~]# sysctl vfs.zfs.resilver_delay
sysctl: unknown oid 'vfs.zfs.resilver_delay'

BTW, it sped up a lot, it's now at 60%, but over the last hour it's back to a crawl. I have thousands, perhaps tens of thousands of snapshots from many filesystems. Also, the pool is 78% full. So I think the drives / controller are OK, just that my pool is pushing the boundaries of ZFS a little.

Can you tell me if a replace is considerably fastrer than an attach? Because I went intot eh office over night and popped the drive out and in. It registered in camcontrol no problem, so the original drive seems to be OK. I am hoping that a "zfs replace" of the spare will be faster than this current "zfs attach" resilver.

ethoms · Jun 5, 2014

Sorry, rude of me not to post my versions, here we go:

Code:

[root@beastie1 ~]# uname -r
8.3-RELEASE
[root@beastie1 ~]# zpool upgrade
This system is currently running ZFS pool version 28.

All pools are formatted using this version.

kpa · Jun 5, 2014

The time estimate is a crude approximation based on data that has already been scrubbed and at the start of the scrub is usually way off towards the slower direction and it takes some time before a better estimate appears. The actual scrubbing speed is still the same during the whole operation.

t1066 · Jun 5, 2014

My guess is that when resilver first start, it has to read all those metadata. But most of the metadata should be less than 4KB in size. And hard disk's 4KB transfer rate is very poor. But once all those metadata are read, the typical blocksize will become 128KB. Hence, the significant speed up after awhile. It all depends on the typical blocksize that is in used and the limitation of hard disks.

ethoms · Jun 5, 2014

Thanks for the encouragement. It's slowed down again, or at least the stats have.

Can anybody tell me if a replace is any faster than an attach?

Toast · Jun 5, 2014

ethoms said:
Can anybody tell me if a replace is any faster than an attach?

It shouldn't matter much as they both do the same thing. zpool() replace is just "attach/resilver/detach" in one command.

Code:

     zpool replace [-f] pool device [new_device]

         Replaces old_device with new_device.  This is equivalent to attaching
         new_device, waiting for it to resilver, and then detaching
         old_device.

ethoms · Jun 6, 2014

Thanks toast. My attach has finished resilvering now. I'm going to get my backup going again first. Then I'm going to thin out my snapshots. Then replace the spare with the original drive. It should be faster when the snapshots are thinned out.