Solved ZFS Snapshot send/recv fails when triggered via cronjob

Hi,

I have a script that runs every night and basically does the following (with some logging around it and some checks to see, if some older snapshots exist)

Code:
zfs snapshot -r $snapshot_today
zfs send -R -i $snapshot_yesterday $snapshot_today | zfs recv -Fdu $destination
The backup works, if I trigger the script manually.

But when I call the script as cronjob, it fails every now and then with the following error during the send/recv

Code:
Assertion failed: (0 == nvlist_lookup_nvlist(snapprops, thissnap, &snapprops)), file /usr/src/cddl/lib/libzfs/../../../cddl/contrib/opensolaris/lib/libzfs/common/libzfs_sendrecv.c, line 1237.
cannot receive: failed to read from stream
Abort trap (core dumped)
Afterwards I have to recreate the backup from scratch.

This only ever happens when I call the script from a cronjob. I have manually done my backups for over a month now just by opening a tmux session and calling the script to verify that everything works. I even tried running the script in a detached screen session like so

Code:
/usr/local/bin/screen -d -Logfile /var/log/zfs_backup.log -L -m /usr/local/bin/system_backup.sh
This also works when done manually but fails with the above error as cronjob.

Is there anything I have to do differently to have a zfs send/recv work as cronjob or is this a bug?

Regards,
Stefan
 
Is there anything I have to do differently to have a zfs send/recv work as cronjob or is this a bug?
Nothing needs to be done differently.

What I can imagine happening is that maybe those two snapshot variables contain something weird and that's causing the error.
 
Hmm, at least the logging happening directly before the send/recv is as expected. For example for todays failed task
Code:
taking todays snapshot, storage@autobackup-2018-01-18
yesterday snapshot, storage@autobackup-2018-01-17, exists lets proceed with backup

What I forgot to mention is that it also always fails at a different dataset@snapshot. It begins fine but at some point while traversing through the dataset snapshots something gets messed up.

Here's the whole script for your information. The script gets passed the pool name as first argument. This is done in a wrapper script that imports and exports the backup pool and triggers a scrub every couple of days, if a modulo-condition for the current date is matched
Code:
#!/bin/sh

if [ -z $1 ]; then
    echo "zpool to backup has to be supplied as argument to ${0}"
    exit 1
fi

type="autobackup"
pool=$1
destination="backup/${1}"

today=`date +"$type-%Y-%m-%d"`
yesterday=`date -v -1d +"$type-%Y-%m-%d"`

zfs_bin=/sbin/zfs

snapshot_today="$pool@$today"
# look for a snapshot with this name
if $zfs_bin list -H -o name -t snapshot | sort | grep "$snapshot_today$" > /dev/null; then
    echo " snapshot, $snapshot_today, already exists"
    exit 1
else
    echo " taking todays snapshot, $snapshot_today"
    $zfs_bin snapshot -r $snapshot_today
fi

snapshot_yesterday="$pool@$yesterday"

if $zfs_bin list -H -o name -t snapshot | sort | grep "$snapshot_yesterday$" > /dev/null; then
    echo " yesterday snapshot, $snapshot_yesterday, exists lets proceed with backup"

    echo " => Command used is $zfs_bin send -R -i $snapshot_yesterday $snapshot_today | $zfs_bin recv -Fdu $destination"
    $zfs_bin send -R -i $snapshot_yesterday $snapshot_today | $zfs_bin recv -Fdu $destination

    echo " backup complete destroying yesterday snapshot"
    $zfs_bin destroy -r $snapshot_yesterday
    exit 0
else
    echo " missing yesterday snapshot aborting, $snapshot_yesterday"
    exit 1
fi
 
Try using a buffer (e.g. mbuffer) between send and recv. I've observed some weird behaviour with zfs send|recv especially over slow or high latency network links (VPN), which disappeared since using mbuffer for caching on both sides.
As a side effect, the send|recv finishes faster because no side has to wait for the other to either comb through metadata or finish writing and the data stream is much more connstant.
 
I will give mbuffer a try, even though I don't think this will solve it since it works flawlessly, when I do a "live" backup instead of a cronjob one. There has to be a difference between a login-session and the cronjob-"session". But even if the environment is different, I do not understand why zfs cannot read from the stream? I will also switch to esata and check that. It still feels a bit verbose, since in general it is working.
 
changed the shebang to
Code:
#/usr/local/bin/bash -l
and for now it is working as expected. So this weird behavior that an assertion fails but only sometimes and for different datasets that are not related to each other seems to be triggered by the minimal cron environment. I have to narrow it down further because I don't want to spawn a fully fledged bash login shell where a sh with some fitting environment variables is sufficient. But for now I mark this as solved. Thanks for your ideas.
 
Always try to write proper/valid bourne shell code - these scripts are far more portable (bash isn't available on most BSD or solaris/illumos default installations!) and less error-prone due to inconsistencies between bash versions.

From a quick glance I'd suspect your if statements are the reason why the script won't work with /bin/sh. Try putting [ ] braces around the whole condition and think about using use proper evaluation, e.g. by using grep -c on the zfs list output and compare them to an integer: -ne 0. This way the desired logic behind the if statement is easier to grasp if you look at your code one or five years later...
 
Yeah, I do know better shell scripting. But this script was just copied from somewhere else. Usability of this script is pretty bad but I didn't spend any time to revise it yet. I just needed it to work for the moment.
But I honestly don't think the if statement is the reason for the error but some environment info that the zfs binary is missing when run through a cronjob Because as I have said, it works when not run through crond.
 
I observe the same kind of problem when sending a zfs snapshot from fcron.
Taking a snapshot worked fine.
The solution was specifying to full pathname on every command.
As test export PATH="" and verifying every script runs correctly.
The default PATH in which fcron jobs run is nothing more than "/usr/bin:/bin"
 
Back
Top