ZFS Buffering zfs send/receive

gpw928

Well-Known Member

Reaction score: 129
Messages: 378

Hi,

I am converting the tank on my ZFS server from 5 spindle RAIDZ1 (da0 - da4) to 7 spindle RAIDZ2 (da0 - da6).

In order to do that I have to copy all the data out of the tank to temporary storage, re-create the tank in RAIDZ2 format, and copy all the data back again.

There's currently about 7.5 TB of data to move (but planning to get that down to 6.5 TB). That's going to take quite a while, and I want to explore any option to speed it up.

The two extra spindles required for the 7 spindle RAIDZ2 pool (da5 and da6) are already installed.

I also have two new 4 TB external USB 3.1 disks (da7 and da8) to use as temporary storage for the data shuffle. They are ultimately destined for off-site backup rotation.

Does anyone have experience with buffering the zfs send/receive processes when they are both running on the same host?

Given that the recordsize for all the ZFS file systems is 128K, would it speed things up if I replace pv with mbuffer -s 128k -m 1G below?
Code:
$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   13.6T  7.60T  6.03T        -         -    31%    55%  1.00x  ONLINE  -
zroot   216G  6.70G   209G        -         -     0%     3%  1.00x  ONLINE  -
# Set up the temporary storage on two 4 TB USB disks
zpool create tankstore /dev/da7 /dev/da8
# Take the initial snapshot
zfs snapshot -r tank@replica1    # -r Recursively create snapshots of all descendent datasets.
zfs send -R tank@replica1 | pv | zfs receive -dF tankstore    # -R moves all properties, snapshots and clones.
# Shut down all client access to the tank, and then
zfs snapshot -r tank@replica2
zfs send -Ri tank@replica1 tank@replica2 | pv | zfs receive -dF tankstore
# Destroy the tank
zpool -f destroy tank
# Re-create the tank as raidz2 with 7 spindles
zpool create tank raidz2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da5 /dev/da6
# Populate the new tank
zfs snapshot -r tankstore@replica1
zfs send -R tankstore@replica1 | pv | zfs receive -dF tank
 

sebhtml

New Member

Reaction score: 2
Messages: 8

Why do you need tank@replica2 ?
Is it not the same as tank@replica1 ?
 
OP
gpw928

gpw928

Well-Known Member

Reaction score: 129
Messages: 378

Why do you need tank@replica2 ?
Is it not the same as tank@replica1 ?
The first zfs send will take many hours. It is followed by the comment "Shut down all client access to the tank", before the second snapshot is taken.

The second zfs send is incremental, sending only the changes made to the tank while the first send was running.
 

Eric A. Borisch

Aspiring Daemon

Reaction score: 310
Messages: 528

I have found buffering to speed up transfers some, as both the read and writes tend to be “bursty” (in terms of bit out to/in from the send/recv stream) — but it’s in general a small speedup locally. (It can be more beneficial when sending over a network, as when run on the receiving side it permits the send to more consistently saturate the connection.)

I can’t imagine either tool being a bottleneck (compared to spinning rust) since they only manipulate data in RAM. Either pv or mbuffer will do; you can set pv’s buffer size with -B. You do want to set it fairly large (don’t push into swap) to get the most benefit.

One thing that is very nice is to dry-run with zfs send -nP <remainder of zfs send flags/args> and use the final out line’s “size NNNN” value with the pv switch -s NNNN to get a completion percentage / time remaining indicator.
 
OP
gpw928

gpw928

Well-Known Member

Reaction score: 129
Messages: 378

One thing that is very nice is to dry-run with zfs send -nP <remainder of zfs send flags/args> and use the final out line’s “size NNNN” value with the pv switch -s NNNN to get a completion percentage / time remaining indicator.
I just tested this. It's great. I'm expecting each send/receive to take in the vicinity of a day, so a progress meter will be really handy.
 

sko

Aspiring Daemon

Reaction score: 269
Messages: 500

I usually use mbuffer on both sides (esp. if piping through nc to another server) with buffer sizes depending on the systems amount of RAM. Especially on pools with many datasets and lots of snapshots this considerably speeds up the transfer after the "warm up" where metadata information is gathered. It's usually a good thing if the buffer on both ends is considerably full, because zfs send|receive usually gets very burst-y, which massively hurts transfer speeds from/to spinning rust, so a full buffer will dampen this to a relatively steady transfer rate.
 
Top