ZFS 'zfs send' over ssh fails with 'invalid backup stream'

Hi all,

I've been seeing an issue for a while now that I haven't been able to make much traction on solving, so I'm turning to you all.

I have a backup script that does incremental zfs sends over ssh to a remote host. This works great on smaller snapshots, but once my snapshots reach 3+ GB I see it occasionally fail.

The failure is always the same:
Code:
receiving incremental stream of test/files/data@backup_snap.delta into remote_pool/backups/test/files/data@backup_snap.delta cannot receive incremental stream: invalid backup stream

At first I thought it was just the ssh pipe being unreliable, so the next time it happened I dumped the delta snapshot that failed into a file on a thumbdrive:
# zfs send -i backup_snap test/files/data@backup_snap.delta > /mnt/thumbdrive/backup_snap.delta.bin'

I then took the thumb drive and drove over to the remote machine, then tried to recv the snapshot directly from the thumbdrive.
# cat /mnt/thumbdrive/backup_snap.delta.bin | zfs recv -duv remote_pool/backups/test/files/data

Much to my surprise, this too failed with the same error!

Then, my curiosity was piqued, so I copied the snapshot file off the thumbdrive and onto a separate pool in the machine.

This time it worked!

So somehow the same exact snapshot failed with 'invalid backup stream' over 2 different mediums, but on the third time it worked.

I've since had cases where the recv worked from the thumbdrive and not the internal pool.

So, has anyone seen this error? At this point I can now hit in once a week it seems, so I'm shocked I haven't seen much on this other than some ZFS on Linux threads.

The 'local' machine doing the send is running 10.3, the remote machine is 11.0, both AMD64. I'm up for any sort of debugging, I'd love to squash this bug.
 
I did a search through the code on github and found that the error string is a result of the error EZFS_BADSTREAM, it looks like this is only used in a single file libzfs_sendrecv.c

Does anyone have any experience or guides on using dtrace with libzfs? I'm thinking some simple probes should be able to tell me exactly which issue is causing that error code to be returned, and from there I can work backword and figure out what's failing.
 
Back
Top