ZFS ZFS send snapshot hangs on the first packet then time out

Hi,

I am in a situation where I'm trying to replicate my production ZFS system to a backup server on a different site.
We are using ZFS snapshots and I want to be able to sync them daily to keep a longer history of snapshots on my backup server.
Our ZFS system is big: 200TB. So the full replication takes around 10 days (we have a 9GB connection between the 2 sites).

Last time, our full replication was interrupted because of a power outage. However, the copy of the first snapshot and the data it referred to was received. I can see that the second snapshot was incomplete because of the power outage.

From there, I've been trying to restart the transfer by sending the missing snapshots but I'm not able to. Every time I try to send new snapshots, the zfs send command starts, it tries to send the first packet then after around 15min, it just stops. I have no error message on the source or destination.
Here are the snapshots I have on the source:
Code:
root@fs01:/etc# zfs list -t snapshot
NAME                                  USED  AVAIL     REFER  MOUNTPOINT
pool01/ds01@2021-12-28_12.00.01--2w  6.18T      -      188T  -
pool01/ds01@2021-12-29_12.00.01--2w  1.63T      -      188T  -
pool01/ds01@2021-12-30_12.00.01--2w  1.72T      -      187T  -
pool01/ds01@2021-12-31_12.00.01--2w  1.73T      -      186T  -
pool01/ds01@2022-01-01_12.00.01--2w  1.76T      -      186T  -
pool01/ds01@2022-01-02_12.00.01--2w  1.67T      -      185T  -
pool01/ds01@2022-01-03_12.00.01--2w  1.37T      -      184T  -
pool01/ds01@2022-01-04_12.00.01--2w  1.46T      -      184T  -
pool01/ds01@2022-01-05_12.00.01--2w  1.85T      -      184T  -
pool01/ds01@2022-01-06_12.00.01--2w   489G      -      185T  -

Here is what I have on the destination:
Code:
root@04fs01:/tmp# zfs list -t snapshot
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
pool01/repl2-ds01@2021-12-28_12.00.01--2w  6.20T      -      188T  -

Here is the command I run to try to resume the sync:

Code:
root@fs01:/home# zfs send -cvRI pool01/ds01@2021-12-30_12.00.01--2w pool01/ds01@2021-12-31_12.00.01--2w | nc 192.168.13.2 54321
skipping snapshot pool01/ds01@2022-01-01_12.00.01--2w because it was created after the destination snapshot (2021-12-31_12.00.01--2w)
skipping snapshot pool01/ds01@2022-01-02_12.00.01--2w because it was created after the destination snapshot (2021-12-31_12.00.01--2w)
skipping snapshot pool01/ds01@2022-01-03_12.00.01--2w because it was created after the destination snapshot (2021-12-31_12.00.01--2w)
skipping snapshot pool01/ds01@2022-01-04_12.00.01--2w because it was created after the destination snapshot (2021-12-31_12.00.01--2w)
skipping snapshot pool01/ds01@2022-01-05_12.00.01--2w because it was created after the destination snapshot (2021-12-31_12.00.01--2w)
skipping snapshot pool01/ds01@2022-01-06_12.00.01--2w because it was created after the destination snapshot (2021-12-31_12.00.01--2w)
send from @2021-12-30_12.00.01--2w to pool01/ds01@2021-12-31_12.00.01--2w estimated size is 5.57T
total estimated size is 5.57T
TIME        SENT   SNAPSHOT pool01/ds01@2021-12-31_12.00.01--2w
21:49:07    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:08    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:09    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:10    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:11    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:12    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:13    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:14    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:15    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:16    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:17    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:18    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:19    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:20    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:21    392K   pool01/ds01@2021-12-31_12.00.01--2w
21:49:22    392K   pool01/ds01@2021-12-31_12.00.01--2w

On my destination I'm running the following:
Code:
nc -l 54321 -w 60 |  zfs receive pool01/repl2-ds01


So I'm looking for a way to resume the transfer without having to restart from the beginning and spending 10 days of copy. But if it's the only way, I will do it, I need that replication to work.

I am by no mean an expert with zfs, so I'm thankful for any help.
 
send to /dev/null
if it fails then it's the sender
Thanks for the answer, I don't understand what you mean by that?

If I do send it to a new destination, it will simply send everything starting from scratch which is the situation I'm trying to avoid.
 
if you can send a snapshot to /dev/null without problems it means that zfs send works and the problem is at the receiver (or tcp transit)
then send to /dev/null at the receiving end (pipe recv netcat to /dev/null)
if this works then zfs receive is the problem
 
Back
Top