ZFS send/receive over slow lines

Happy new year to all!

I am having a issue with a ZFS send/receive backup script that is constantly stalling overnight. The problem is that I don't know how to troubleshoot the issue.

The script is running on many servers without any problems. In this case we have 2 servers with ZFS on root in US and a offsite server in Germany who receives the differential snapshots of both US servers overnight via a cronjob.

The link is over an IPsec VPN and I have to admit that I am very disappointed with the speed of the receiving part in Germany, 4.5 Mbit max. The latency is not that bad though, average 105 ms.

The first server completes all the transfers successfully but the second does not.

Code:
#!/bin/sh
 
pool="zroot"
destination="zxf2"
host="10.49.0.10"
 
today=`date +"$type-%Y-%m-%d"`
yesterday=`date -v -1d +"$type-%Y-%m-%d"`
 
# create today snapshot
snapshot_today="$pool@$today"
# look for a snapshot with this name
if zfs list -H -o name -t snapshot | sort | grep "$snapshot_today$" > /dev/null; then
echo " snapshot, $snapshot_today, already exists"
exit 1
else
echo " taking todays snapshot, $snapshot_today"
zfs snapshot -r $snapshot_today
fi
 
# look for yesterday snapshot
snapshot_yesterday="$pool@$yesterday"
if zfs list -H -o name -t snapshot | sort | grep "$snapshot_yesterday$" > /dev/null; then
echo " yesterday snapshot, $snapshot_yesterday, exists lets proceed with backup"
 
zfs send -R -i $snapshot_yesterday $snapshot_today | ssh root@$host zfs receive -Fdv $destination
 
echo " backup complete destroying yesterday snapshot"
zfs destroy -r $snapshot_yesterday
exit 0
else
echo " missing yesterday snapshot aborting, $snapshot_yesterday"
exit 1
fi

For some reason the second server does not complete and I have to manually perform the operation. Although when I do it manually after, I see that most of the data has been almost transferred and it finishes within 2-3 hours. We are talking about 77GB of data and the differential is usually not more than 1 GB.

If you need more information please let me know. I would appreciate any help at this point.
 
You data transfer is taking place that means your script is working fine, that may not be problem.
Most likely problem would be with IPSec.
Did you notice any connection flapping in tunnel?
If you suspect that problem is with IPSec then I would be able to help.
 
The problem is not the IPsec tunnel. In order to eliminate this I temporarily used the script not over the VPN. The results were exactly the same.

Unfortunately, Hetzner where the offsite backup server is located, maxes at 4.5 Mbit. Their official response is that they can not guarantee bandwidth outside their premises. I have send them reports and statistics from FTP transfers that I have performed from various FTP servers in Germany. Most of them max at 90Mbit.
 
As I understand your Situation is:
1. You don't have any problem with Link on German side. (Although Working slow but steady)
2. First Server(US) -> Hetzner(Germany): Everything Okay.

That means your problem is located on US side.
If We presume Second server is somehow much closer to source of problem.
Two things you can do is:
1. Schedule cron job twice, one at 4:AM or something or something like that so you can recursively transfer.

When I had problem with ZFS send and receive slower transfer speed because of it being bursty in nature I solved it using mbuffer.
mbuffer can increase speed dramatically.

Try something like:
mbuffer -s 128k -m 1G -I 8085 | zfs receive zpool/dataset

On Sender side:
zfs send -i zpool1/dataset | mbuffer -s 128k -m 1G -O 10.0.0.1:8085
You can also use with ssh so no need for IPSec tunneling, With ssh We get better speed.

If you want to use it in script then:
zfs send zpool/dataset|@snap | mbuffer -s 128k -m 1G 2>/dev/null | ssh -c arcfour128 $remotehost “mbuffer -q -s 128k -m 1G 2>/dev/null | zfs recv -F -v zpool/dataset”

Hopefully this will help.
 
abhay4589 said:
As I understand your Situation is:
1. You don't have any problem with Link on German side. (Although Working slow but steady)
2. First Server(US) -> Hetzner(Germany): Everything Okay.

That means your problem is located on US side.

No definitely not. Just the first server has very minor daily changes.

abhay4589 said:
If We presume Second server is somehow much closer to source of problem.
Two things you can do is:
1. Schedule cron job twice, one at 4:AM or something or something like that so you can recursively transfer.

When I had problem with ZFS send and receive slower transfer speed because of it being bursty in nature I solved it using mbuffer.
mbuffer can increase speed dramatically.

Try something like:
mbuffer -s 128k -m 1G -I 8085 | zfs receive zpool/dataset

On Sender side:
zfs send -i zpool1/dataset | mbuffer -s 128k -m 1G -O 10.0.0.1:8085
You can also use with ssh so no need for IPSec tunneling, With ssh We get better speed.

If you want to use it in script then:
zfs send zpool/dataset|@snap | mbuffer -s 128k -m 1G 2>/dev/null | ssh -c arcfour128 $remotehost “mbuffer -q -s 128k -m 1G 2>/dev/null | zfs recv -F -v zpool/dataset”

Hopefully this will help.

mbuffer has also been under consideration and since all traffic passes from a VPN there are no security issues here. Your script has also been tested (sorry) but it still relies on ssh. The correct way to use it from a secure line requires 2 actions.

  • Start the socket on the receiving side
  • Initiate the transfer from the sending side

At this point I am using a "bucket" directory where all transfers are kept. Then the snapshots are send as plain files over rsync. If the receive part succeeds then the bucket gets cleared.
 
abhay4589 said:
ran out of arsenal that day, Please post solution if you find one to this weird problem.:)

Ok, it sounds funny (and stupid) but it turns out to be a cron issue which I had no idea about:

Code:
.....
zfs send -R -i $snapshot_yesterday $snapshot_today | ssh root@$host zfs receive -Fdv $destination
.....

When this command gets executed, it is sending the progress to the standard output. Cron has a problem with that and the process dies!
So, changing the line to:

Code:
zfs send -R -i $snapshot_yesterday $snapshot_today | ssh root@$host zfs receive -Fdv $destination [B]> /dev/null[/B]

Seems to resolve the issue.
 
@gkontos

Since you´re running it from cron and won´t be able to see that progress any way, you might as well just omit the "v" from recv.

/Sebulon
 
After dealing with several machines and slow links I decided to enhance the script.

The idea is to log everything to a file and have an external tool to monitor and alert us. I personally use net-mgmt/zabbix2-server but anything similar would do.

The full script:

Code:
#!/bin/sh
export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:
pool="zroot"
destination="pool"
host="10.10.10.1"

if [ -f /tmp/backupscript.lock ]; then
        logger -p local5.notice "Backup Still Running FAILED"
        exit 1
else
        touch /tmp/backupscript.lock
fi

today=`date +"$type-%Y-%m-%d"`
yesterday=`date -v -1d +"$type-%Y-%m-%d"`
day=`date -v -5d +"$type-%Y-%m-%d"`

# create today snapshot
snapshot_today="$pool@$today"

# look for a snapshot with this name
if zfs list -H -o name -t snapshot | sort | grep "$snapshot_today$" > /dev/null; then
        echo " snapshot, $snapshot_today, already exists skipping"
else
        echo " taking todays snapshot, $snapshot_today"
        zfs snapshot -r $snapshot_today
fi

# look for yesterday snapshot
snapshot_yesterday="$pool@$yesterday"
if zfs list -H -o name -t snapshot | sort | grep "$snapshot_yesterday$" > /dev/null; then

        if zfs send -R -i $snapshot_yesterday $snapshot_today | mbuffer -q -v 0 -s 128k -m 1G | ssh root@$host "mbuffer -s 128k -m 1G | zfs receive -Fd $destination" > 0; then
                logger -p local5.notice "Backup OK"
        else
                logger -p local5.error "Backup FAILED"
        fi
        rm /tmp/backupscript.lock
        zfs destroy -r $day
        exit 0
else
        logger -p local5.error "missing yesterday snapshot Backup FAILED"
        rm /tmp/backupscript.lock
exit 1
fi

We also need to modify /etc/syslog.conf and add something like this:

Code:
local5.*					/var/log/backup.log

This script will retain a 5 day full history of snapshots. If for some reason the transfer fails it will log it. If the backup tries to run while the previous backup has not completed it will fail and log it. In addition I am also using misc/mbuffer which increases the performance.

So, with the right tool we can generate alerts and monitor the status.
 
Back
Top