Solved Resizing a vdev; restore from backup or resilver?

`Orum

Active Member

Thanks: 14
Messages: 121

#1
I have a storage pool at home with a single vdev, 3 disks in raidz1. As I've found myself needing more space, I'm going to replace the existing disks with new ones, but this leaves two options:
  • Remove the existing pool, create it again with the new disks, restore from backup via another zfs send/recv
  • Replace each disk one at a time, and resilver between each replacement.
My main concern on this pool is minimizing fragmentation, which is already > 25% on the pool. My understanding is the latter method will not reduce fragmentation, as resilvering will not reorganize data on the disk at all. Therefore it's better to do the former, as zfs send/recv will reorder things to be less fragmented. Is that correct?

I also will be doing the same on several servers at work, where uptime is a bigger concern, and I will be transferring both root & storage pools. I assume the only option there that doesn't involve any downtime is the latter, but I'm thinking of ways to perform the former that minimize downtime. So far the best way I can think of is:
  1. Snapshot and transfer the full pool to a new pool.
  2. Do an incremental snapshot and transfer between the original pool and the new one to cover the time between the last snapshot and now
  3. Repeat step 2 as many times as necessary to asymptotically reduce the time to transfer incremental snapshots. (Edit: I'm guessing this might negate the defragmentation benefits of this method?)
  4. Shutdown the machine, boot from other media, do one final snapshot and send it as an incremental snapshot.
  5. Tweak all the necessary configs to make the new pool bootable, remove the old one, and boot.
Any thoughts?

Edit: Actually, looking at my backups, even one made fresh with zfs send/recv with no incremental snapshots transferred, shows a huge amount of fragmentation. Perhaps it's no better than simply resilvering...
 

sko

Well-Known Member

Thanks: 206
Messages: 407

#2
Just replace each disk and let them resilver. Fragmentation drops as the available free space increases because it is a metric of the distribution of the free space across the disk. As free space decreases, the free spots are more randomly distributed over the disk and hence the fragmentation metric increases.

Matt Ahrens explained fragmentation in the context of space allocation in his talk at BSDcan 2016: https://www.bsdcan.org/2016/schedule/events/710.en.html
From slide #7:
FWFS = CHUNK_SIZE * NUM_CHUNKS * (1-PCT)
frag% = 1 - FWFS / TOTAL_FREE
FWFS = fragmentation-weighted free space
PCT = % fragmentation of the free chunks at given size (see histogram on slide #7)

The important equation is the second one: as TOTAL_FREE decreases, the result will increase.

So if you increase TOTAL_FREE by resizing the vdev/pool; frag% will drop rather significantly.
Also: 25% fragmentation is no big deal. IIRC everything <80% won't really have any significant impact on performance nowadays e.g. due to the write throttling Matt Ahrens explained in the talk. On SSDs fragmentation usually has no effect (unless you're also running out of space) because SSDs really don't care if you're writing adjacent or randomly distributed blocks.

Some of our root pools (on SSD) have been at ~60-70% fragmentation for quite a while now. It seems to me this is kind of a normal value for (small) pools where lots of snapshots are taken/removed each day.
 
OP
OP
O

`Orum

Active Member

Thanks: 14
Messages: 121

#3
Just an update for anyone else who stumbles across this later on. I'm doing the resilvering method, but I think it's slower. Why? I need to resilver, then scrub (to make sure the resilvering actually wrote the correct data), and repeat this process 3 times. Each resilver plus scrub takes around ~12 hours, and that's assuming I'm there to change the drives right when they need to be swapped. That's a minimum of 36 hours total for those of you keeping track.

In contrast, scrubbing your backup (assuming you are doing a regular incremental backups, and using the original disks as your "second" backup), and then writing the whole thing back, would take an estimated (as I don't know how long it would take to restore the backup) 20 hours, followed by a scrub which would take, I estimate, 5 more hours for a total of 25 hours. Additionally, one only needs to be there once to swap all the drives.

To be fair, one doesn't have to scrub between each resilvering, but it seems like the safest way to ensure you don't have a problem with one of the new disks and have to then restore from backup, and that would put the two methods more on par with one another.
 

sko

Well-Known Member

Thanks: 206
Messages: 407

#4
The scrubbing really isn't necessary as ZFS is a) storing multiple copies/checksums and b) will likely re-read the blocks from the new drive as soon as you replace the next disk.
Because you keep the old disk in the pool until the replace finished, you'd still have the "original" copy from that disk during the resilver, so you'd never end up with insufficient copies, even if the previously replaced disk contains errors and the other copie(s) of that block are on the disk you are replacing.

I also highly suspect that ZFS is verifying written blocks when resilvering, but I couldn't find any info on that.


And FTR: resilver is a background job, so it's low-priority and hence might take longer than a send|receive. But with resilver you can do all the disk replacements on a running system without having to take it or any of its services down.


You also don't need to be there for each drive replacement. Just put all drives in and chain all the replacements (& scrubs if you still insist) together:
zfs replace && zfs replace && zfs replace && zfs scrub && `zpool status | mail root`

Before using ZFS I did a burn-in on all new disks by running badblocks 2-3 times and letting the drive run a long/offline SMART self-test. Never did that since I got a drive for my server at home that was faulty from the beginning and it didn't even made it through the resilvering but was kicked out of the pool during the process due to write errors. So ZFS takes enough care of itself to detect faulty drives at resilver, hence I spare myself the additional work...
 
OP
OP
O

`Orum

Active Member

Thanks: 14
Messages: 121

#5
While it would definitely read the blocks during the resilver, I'd hate to be in the situation where the first time it reads the blocks is during the resilver of the degraded pool. While I could probably just throw back in the old drive as you mention, it seems like more hassle than it's worth when a scrub ensures everything is copacetic prior to resilvering.

I'm almost certain zfs does not verify written blocks when resilvering (i.e. those that are written to the new disk). I assume all read blocks are checked as usual, similar to when a file is being read. I base this on the traffic I see from gstat during the resilver, which reports no read operations. Curiously though, gstat does report write operations during a scrub, and I'm not sure why. Perhaps it's looking at traffic on the ahci interface, and issuing a read command counts as a "write" operation (because it's being sent to the disk)? Maybe a developer or someone more knowledgeable can enlighten us. I probably should be using zpool iostat instead...

And yeah, I'm going to be doing resilvers on all the work machines, just for the convenience of not having to take the machines down. I realize it might take longer, but it's probably less of a headache than scheduling downtime.

You also don't need to be there for each drive replacement. Just put all drives in and chain all the replacements (& scrubs if you still insist) together:
zfs replace && zfs replace && zfs replace && zfs scrub && `zpool status | mail root`
I see two problems here. The first one is that in my case, I'm extremely limited by the SATA ports on that machine. It has seven internal SATA ports, and all are in use. My backups are done via the one remaining eSATA port. On our servers at work this is less of an issue.

The other is that the commands are non-blocking, i.e. they return as soon as (or shortly after) you run them, and they run in the background. I'm sure this could be scripted around with some crude cron job or clever use of at and parsing the command output from zpool status, but writing the script to do so takes a lot of time for a task I almost never do.
 
Top