ZFS 'zfs send/recv' during resilver/scrub

Sebulon · Oct 12, 2015

Hey all!

A long time ago now, when I started writing my backup script called 'replicate', I had this notion that you shouldn´t do anything while there was a resilver/scrub in progress, based on something I read that was even older information. So the script halts if there is an ongoing resilver or scrub in either source or destination system.

Now when searching for facts regarding this matter, I come up empty... So thinking about it, is it really necessary? It might just be superstition, based on old FUD that I´ve just been taking for granted. Or have I been imagining this, and it has never been an issue in the first place?

Thing is that with bigger, heavily loaded systems, scrubs and resilvers can take days, even weeks, especially if it´s up at, say 80-90% capacity. And to not have any recent backups for that long period of time just isn´t right. So even if it could cause problems replicating while there is a resilver or scrub running, it´s still preferable to not having any backups.

How are others handling it, like FreeNAS e.g. Do they replicate while scrubbing?

What are your thoughts about this?

/Sebulon

phoenix · Oct 14, 2015

There's no reason you can't run send/recv while doing a scrub/replace, other than it will be slower than normal.

We do send/recv everyday from three local backups servers to the off-site backups server, regardless of what else is running on either system. On a "normal" day, the process takes only a couple of hours each. If one of the local backups servers is running a scrub or replace, then it can take up to 8 hours to do the send. If the remote backups server is running a scrub or replace, then each send can take up to twice as long as normal.

Scrub and resilver are designed to run in the background without affecting "normal" I/O too much.

Terry_Kennedy · Oct 15, 2015

Sebulon said:
Now when searching for facts regarding this matter, I come up empty... So thinking about it, is it really necessary? It might just be superstition, based on old FUD that I´ve just been taking for granted. Or have I been imagining this, and it has never been an issue in the first place?

The closest thing I can think of is the adage that "the most likely time for a drive to fail is during a RAID rebuild". Aside from Murphy's Law, this is because the rebuild can stress the other drives (seeks, etc.) beyond what they see during a typical workload.

Thing is that with bigger, heavily loaded systems, scrubs and resilvers can take days, even weeks, especially if it´s up at, say 80-90% capacity.

What is this "slow" of which you speak?

Code:

  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Feb 27 14:23:19 2013
        5.36T scanned out of 17.8T at 3.56G/s, 0h59m to go
        0 resilvered, 30.18% done

phoenix · Oct 15, 2015

Code:

$ zpool status storage | head
  pool: storage
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub in progress since Sat Oct 10 16:00:02 2015
        31.2T scanned out of 49.2T at 77.7M/s, 67h28m to go
        0 repaired, 63.45% done

You obviously don't have a nearly-full, very fragmented pool running with dedupe enabled.

Takes a full week to resilver a 2 TB drive, and 3 weeks-ish to do a full scrub. And there's only 24 drives in that server.

(We've since turned off dedupe on all new servers, and the ones with 90 drives in them resilver at 100s of MB/s speeds.)

Terry_Kennedy · Oct 15, 2015

phoenix said:
You obviously don't have a nearly-full, very fragmented pool running with dedupe enabled.

ZFS scrubs on volumes with dedupe enabled on FreeBSD was a disaster the last time I looked at it (a couple years ago). Something is artificially limiting the I/O rate (on a system with nothing else going on, observe very little pool I/O via gstat and little to no CPU / memory usage. After an extended period of trying suggestions from both other users and the developers (ranging from changing sysctls all the way through code patches) nothing helped. I looked at the savings from dedupe in my environment (only a few %) and decided to not use it and recreate the pools that were built with dedupe turned on.

Takes a full week to resilver a 2 TB drive, and 3 weeks-ish to do a full scrub. And there's only 24 drives in that server. (We've since turned off dedupe on all new servers, and the ones with 90 drives in them resilver at 100s of MB/s speeds.)

Ouch! What is the normal throughput of the pool for reads and writes? Does gstat show any sort of "hot spot" during scrubs? I assume there are sufficient resources (CPU / memory) and tunables are optimized?

phoenix · Oct 19, 2015

It still manages to saturate a gigabit link for zfs send/recv, so it's "fast enough" for now. It also completes backups runs for the remote servers it looks after within the 8-hour backups window, so we aren't too worried about it. If either of those situations change, though, then we'll look into it more deeply.

Those 2 particular boxes are "the worst case scenario": highly fragmented pool, very low free space (generally under 4 TB free), running with dedupe enabled. We don't expect high performance, high IOps out of them. Just so long as they let the backups and replication finish within the allotted time.

We do have two other storage boxes that don't use dedupe, only lz4 compression, and those ones resilver a 2 TB disk in around 2 hours, and complete a scrub in under a day (90 disks).

They also have 20-30 TB of free space, and are newer so much less fragmentation.

So we know that it's possible to get lots of performance/IOps out of ZFS. And we know it's possible for ZFS to drag along like molasses flowing uphill in January.

Sebulon · Oct 22, 2015

Thank you for your input!

As most this discussion seemed to go off-topic and only Freddie seemed to have any opinion, I have decided to rip those sections out of the script and let it run regardless, albeit slower. Still preferable than not having any recent backups, in my opinion.

/Sebulon

phoenix · Oct 23, 2015

There's also a handful of loader.conf tunables you can tweak to either prioritise scrub/resilver I/O (to make it complete faster, but slow down normal I/O) or prioritise normal I/O (slow down scrub/resilver).

Code:

$ sysctl -d vfs.zfs | egrep -ie "scrub|resilver"
vfs.zfs.no_scrub_prefetch: Disable scrub prefetching
vfs.zfs.no_scrub_io: Disable scrub I/O
vfs.zfs.resilver_min_time_ms: Min millisecs to resilver per txg
vfs.zfs.scan_min_time_ms: Min millisecs to scrub per txg
vfs.zfs.scrub_delay: Number of ticks to delay scrub
vfs.zfs.resilver_delay: Number of ticks to delay resilver
vfs.zfs.vdev.scrub_max_active: Maximum number of I/O requests of type scrub active for each device
vfs.zfs.vdev.scrub_min_active: Initial number of I/O requests of type scrub active for each device

Don't ask what those do, or what settings to use, though.

It's beyond my pay-grade.